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Abstract 

This  thesis  develops  a  new  statistical  framework  for  analyzing  and  processing  stationary  non- 
Gaussian  signals.  The  proposed  framework  consists  of  a  collection  of  mathematical  techniques 
for  modeling  such  signals  as  well  as  an  associated  collection  of  model-based  algorithms  for  solving 
certain  basic  signal  processing  problems.  Two  inference  problems  commonly  encountered  in  practice 
are  given  special  consideration:  (i)  identification  of  the  parameter  values  of  a  non-Gaussian  signal 
source  based  on  a  clean  observation  of  the  source  output;  and  (ii)  recovery  of  the  source  output  itself 
based  on  a  noisy  observation  and  complete  knowledge  of  the  measurement  model.  These  problems 
are  referred  to,  respectively,  as  source  identification  and  signal  estimation. 

Two  probabilistic  signal  models  are  considered.  The  first,  which  is  termed  the  ARGMIX  signal 
model,  is  a  direct  generalization  of  the  classical  autoregressive  (AR)  linear-Gaussian  model.  Under 
the  ARGMIX  model,  a  signal  is  characterized  as  the  output  of  an  AR  linear  time-invariant  (LTI) 
system  driven  by  a  noise  process  whose  samples  are  independent  and  identically  distributed  accord¬ 
ing  to  a  Gaussian-mixture  (GMIX)  density,  rather  than  a  purely  Gaussian  density.  For  this  model, 
the  source  identification  problem  can  be  solved  efficiently  with  an  iterative  technique  designed  to 
estimate  the  AR  parameters  of  the  LTI  system  as  well  as  the  means,  variances,  and  weighting  coef¬ 
ficients  of  the  GMIX  density.  However,  the  problem  of  optimally  estimating  an  ARGMIX  signal  in 
independent  additive  noise  is  shown  to  require  a  number  of  computations  growing  exponentially  with 
the  number  of  samples  contained  in  the  observation.  For  the  signal  estimation  problem,  therefore, 
only  approximate  suboptimal  algorithms  are  proposed. 

A  second  signal  model  is  introduced  as  a  way  of  overcoming  the  computational  complexity  of  the 
ARGMIX  structure.  This  new  model  is  used  to  approximate  an  arbitrarily  complicated  stationary 
signal  by  representing  it  as  the  output  of  a  finite-state  hidden  Markov  model  (HMM).  Such  a 
representation  is  generated  by  quantizing  the  underlying  signal  dynamics,  i.e.,  by  partitioning  the 
state  space  of  the  original  signal,  assigning  each  region  within  this  partition  to  a  unique  state  of  the 
Markov  chain  in  the  HMM,  and  specifying  appropriate  state  transition  probabilities  for  this  Markov 
chain;  output  densities  are  then  assigned  to  the  HMM  states  to  complete  the  approximation.  An 
analytical  method  is  given  for  determining  the  best  HMM-based  representation  of  a  signal  when  the 
signal  density  is  precisely  known.  Computationally  efficient  algorithms  are  derived  for  performing 
both  source  identification  and  signal  estimation  based  on  this  new  finite-state  model.  For  the  signal 
estimation  problem  in  particular,  a  potentially  powerful  technique  is  proposed  for  dealing  with 
independent  additive  noise  whose  samples  may  in  general  be  both  non-Gaussian  and  temporally 
dependent. 
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Chapter  1 

Introduction 


1.1  Subject  Matter  and  Purpose 

Many  signals  produced  by  real-world  systems,  whether  natural  or  man-made,  carry  in¬ 
formation  in  a  form  that  is  conveniently  modeled  as  a  random  but  structured  pattern  of 
fluctuations  over  time.  A  primary  objective  in  designing  a  signal  processor  is  to  extract  this 
information  accurately  and  efficiently,  so  that  meaningful  inferences  can  be  made  about  the 
signal  source  at  a  reasonable  computational  cost.  In  this  thesis,  we  concentrate  on  solving 
such  inference  problems  in  cases  where  the  signals  involved  do  not  necessarily  obey  the  clas¬ 
sical  Gaussian  probability  law.  The  central  goal  of  the  thesis  is  to  extend  the  traditional 
linear-Gaussian  signal  processing  framework  by  developing  a  new  set  of  modeling  concepts 
and  estimation  techniques  that  can  be  used  to  solve  certain  basic  non-Gaussian  inference 
problems. 


1.2  Preliminary  Assumptions  and  Problem  Formulation 

1.2.1  Assumptions  on  the  Measurement  Model 

A  block  diagram  emphasizing  the  main  elements  of  a  typical  inference  problem  that  we  will 
consider  is  shown  in  Figure  1-1.  This  diagram  depicts  a  signal  of  interest,  {Yj},  that  is 
initially  generated  by  some  physical  source,  is  then  subjected  to  an  uncertain  transforma¬ 
tion  (e.g.,  transmission  over  a  noisy  medium  or  passage  through  an  imperfect  measurement 
device),  and  is  finally  converted,  via  an  appropriately  designed  signal  processor,  into  infor¬ 
mation  about  the  source  to  be  used  by  an  observer.  The  source  signal  {It}  might  be,  for 
example,  a  telecommunications  waveform,  a  geophysical  signal,  or  a  financial  time  series. 
We  will  assume  throughout  our  work  that  {Ft}  is  a  discrete-time,  scalar- valued,  stationary 
random  process,  and  that  it  is  completely  characterized  by  a  fixed,  finite-dimensional  pa¬ 
rameter  vector,  which  we  denote  by  '3'.  Furthermore,  we  will  assume  that  the  probabilistic 
mapping  shown  in  Figure  1-1  takes  the  form  of  a  stationary  additive  noise  process  which 
is  statistically  independent  of  {Ft};  we  denote  this  corrupting  noise  by  {Vj}.  The  above 
assumptions  on  the  signal  and  noise  imply  that  the  observation,  { Zt },  is  also  stationary  and 
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Figure  1-1:  Block  diagram  depicting  the  key  elements  of  a  typical  inference  problem, 
is  defined  by  Zt  —  Yt  +  Vt- 

1.2.2  Inference  Problems  to  be  Considered 

In  this  thesis,  we  restrict  our  attention  to  two  specific  inference  problems  of  the  type  depicted 
in  Figure  1-1.  We  refer  to  these  problems  as  source  identification  and  signal  estimation.  In 
the  source  identification  problem,  it  is  assumed  that  the  noise  included  in  the  observation 
has  negligible  power;  hence,  we  take  the  sequence  {Vt}  to  be  identically  zero.  In  this 
problem,  we  are  given  a  mathematical  model  for  the  source  signal  {It}  as  well  as  a  finite- 
length  sequence  yo,yi,---  ,yN- 1  of  uncorrupted  realizations  of  the  signal.  Our  goal  is  to 
estimate  the  value  of  the  signal  parameter  vector  In  the  signal  estimation  problem,  we 
are  given  mathematical  models  for  the  source  signal  {It}  and  for  the  additive  noise  process 
{Vt}  (including  the  values  of  all  model  parameters),  as  well  as  a  finite-length  sequence  of 
noisy  observations  zq,z\,---  ,z^~ i-  Our  goal  is  to  estimate  the  underlying  signal  values 
2/o,  2/i,  •  ■  •  ,  2/A7— i-  We  will  give  more  specific  estimation  criteria  for  each  of  these  problems 
as  we  impose  additional,  more  concrete  structure  on  our  measurement  model.  In  general, 
however,  we  will  attempt  to  find  a  maximum  likelihood  (ML)  estimate  when  solving  the 
source  identification  problem  and  a  minimum  mean  squared  error  (MMSE)  estimate  when 
solving  the  signal  estimation  problem. 

1.2.3  Remarks  on  the  Problem  Formulation 

The  problems  of  source  identification  and  signal  estimation  as  defined  above  are  clearly 
idealizations  of  their  more  complicated  counterparts  arising  in  practice.  For  example,  there 
exist  many  practical  situations  in  which  we  would  like  to  identify  the  parameters  of  a 
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signal,  but  we  have  only  corrupted  observations  of  the  signal  available.  On  the  other  hand, 
there  exist  situations  in  which  we  would  like  to  estimate  a  signal  in  noise,  but  we  have 
only  partial  knowledge  of  the  parametric  measurement  model.  Both  types  of  situations 
call  for  the  solution  of  a  joint  problem  involving  aspects  of  both  source  identification  and 
signal  estimation;  however,  a  joint  problem  of  this  kind  may  be  too  complex  to  serve  as 
a  starting  point  for  the  development  of  a  new  inference  framework.  Consideration  of  the 
two  simplified  problems  described  above  will  allow  us  to  explore  a  number  of  important 
issues  in  non-Gaussian  signal  processing  and  still  provide  a  suitable  foundation  for  future 
investigations. 

We  remark  further  that  the  stationarity  assumption  on  both  the  signed  and  noise  has 
been  introduced  mainly  to  simplify  later  discussions  and  analysis.  Although  this  assump¬ 
tion  may  seem  somewhat  restrictive,  in  practice  it  usually  does  not  pose  a  serious  difficulty, 
for  many  real-world  signals  can  be  considered  stationary  as  a  good  working  approximation. 
In  cases  where  we  cannot  legitimately  regard  the  signal  as  being  stationary  over  the  entire 
observation  interval,  we  can  often  decompose  the  observation  interval  into  a  series  of  subin¬ 
tervals,  and  then  treat  the  portion  of  the  signal  in  each  subinterval  as  being  stationary. 
This  strategy  is  commonly  used,  for  example,  in  the  analysis  and  processing  of  speech  sig¬ 
nals,  whose  statistical  characteristics  change  dramatically  over  time  but  remain  reasonably 
stable  over  brief  intervals  [153,  47],  In  other  cases  where  stationarity  does  not  hold  for  the 
entire  observation  interval,  it  may  be  possible  to  transform  the  signal  in  some  way  so  as  to 
induce  approximately  stationary  behavior.  This  technique  is  often  used  for  certain  financial 
or  economic  time  series,  which  may  be  non-stationary  only  because  they  contain  simple 
deterministic  growth  or  seasonal  trends  over  time  [70,  128,  196]. 

1.3  Traditional  Approach  to  the  Inference  Problems 

1.3.1  The  Linear-Gaussian  Measurement  Model 

In  order  to  develop  feasible,  working  solutions  to  the  inference  problems  described  above, 
it  is  traditionally  assumed  that  the  stochastic  structure  of  the  source  signal,  as  well  as 
that  of  any  corrupting  noise  present  when  the  signal  is  observed,  is  adequately  described 
by  a  Gaussian  probability  density  function  (pdf).  This  assumption  is  often  synthesized 
from  a  systems  point  of  view,  namely  such  that  the  source  in  Figure  1-1  is  defined  to 
be  a  stable,  linear,  time-invariant  (LTI)  system  driven  by  zero-mean,  unit- variance  white 
Gaussian  noise.  The  output  of  the  source,  {Ft},  then  possesses  a  Gaussian  pdf  whose  specific 
form  is  determined  entirely  by  the  impulse  response  of  the  LTI  system.  The  corrupting  noise, 
{Vt},  is  often  assumed  to  be  zero-mean  white  Gaussian  noise  having  some  fixed  power  level. 
These  assumptions  placed  on  the  signal  and  noise  are  typically  referred  to  as  the  linear- 
Gaussian  model. 

1.3.2  Advantages  of  the  Classical  Model 

There  are  several  reasons  why  the  classical  model  described  above  has  enjoyed  immense 
popularity  in  the  past.  Clearly,  the  linear-Gaussian  assumption  is  often  invoked  for  the 
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sake  of  mathematical  convenience,  since  it  leads  to  the  tractable  derivation  and  analysis  of 
theoretically  optimal  signal  processing  algorithms.  But  the  model  has  also  been  successfully 
applied  in  a  wide  range  of  practical  problems  —  including  applications  in  signal  analysis, 
filtering,  prediction,  and  control  —  over  a  span  of  many  decades.  Its  continued  use  and 
satisfactory  performance  in  such  diverse  applications  clearly  validate  the  classical  model 
as  a  good  first-order  approximation  of  many  real-world  signals  and  systems.  Indeed,  in 
certain  situations,  compelling  physical  arguments  can  be  made  that  justify  the  use  of  the 
linear-Gaussian  assumption  via  the  Central  Limit  Theorem  [125]. 

The  body  of  literature  that  has  evolved  around  the  linear-Gaussian  model  has  become 
quite  rich  and  extensive;  as  a  result,  the  mathematical  theory  associated  with  the  model 
is  now  fully  developed  and  well  understood.  In  addition,  a  number  of  elegant  and  pow¬ 
erful  algorithms  have  been  developed  in  conjunction  with  the  model.  These  include,  for 
example,  the  methods  of  Levinson  [108],  Durbin  [51],  and  Burg  [35],  which  are  essentially 
solutions  to  the  source  identification  problem  under  the  assumption  that  the  LTI  system  in 
the  model  is  purely  autoregressive,  as  well  as  the  widely  used  techniques  of  Wiener  [228] 
and  Kalman  [88,  89],  which  axe  optimal  solutions  to  the  signal  estimation  problem  under 
somewhat  more  general  model  assumptions.  These  and  other  algorithms  developed  for  the 
linear-Gaussian  model  axe,  in  general,  computationally  efficient,  easy  to  implement,  and 
fairly  straightforward  to  analyze. 

1.3.3  Limitations  of  the  Classical  Model 

In  spite  of  its  many  desirable  properties,  the  linear-Gaussian  model  also  has  a  number  of 
limitations  which  cast  doubt  on  its  appropriateness  in  certain  signal  processing  problems. 
Invoking  the  traditional  Gaussian  assumption  actually  imposes  rather  stringent  structural 
constraints  on  the  waveforms  being  modeled.  For  example,  the  Gaussian  pdf  is  inherently  a 
symmetric  function;  hence,  any  non-Gaussian  signal  whose  pdf  exhibits  pronounced  asym¬ 
metry  is  not  likely  to  be  adequately  described  by  the  classical  model.  In  addition,  we  note 
that  the  pdf  of  any  zero-mean,  stationary  Gaussian  signal  is  completely  characterized  by 
the  set  of  second-order  signal  moments  (or,  equivalently,  by  the  autocorrelation  function); 
in  contrast,  in  order  to  specify  the  pdf  of  a  non-Gaussian  signal,  higher-order  moments  (pos¬ 
sibly  an  infinite  number  of  them)  are  required.  Furthermore,  if  we  combine  this  property 
of  sufficiency  of  second-order  moments  with  the  fact  that  the  autocorrelation  function  is 
symmetric,  we  find  that  the  pdf  of  any  stationary  Gaussian  signal  is  invariant  with  respect 
to  a  time-reversal  of  the  signal;  on  the  other  hand,  the  pdf  of  a  stationary  non-Gaussian 
signal  is  typically  quite  sensitive  to  the  orientation  of  the  time  axis  [201]. 

Yet  another  major  limitation  of  the  linear-Gaussian  model  is  that  it  is  not  suitable  for 
representing  signals  that  exhibit  sudden  high-amplitude  bursts  or  sporadic  outliers.  Such 
signals  constitute  a  broad  and  important  class  of  non-Gaussian  phenomena  that  arise  in 
practical  settings;  they  are  encountered  in  applications  such  as  underwater  acoustical  anal¬ 
ysis  and  signal  detection  [29,  52,  113,  230],  low-frequency  and  other  modes  of  communica¬ 
tion  [26,  107,  124,  126,  174],  and  exploration  seismology  [219],  to  name  just  a  few.  Signals 
and  noise  that  exhibit  impulsive  behavior  cannot  be  accurately  represented  by  a  linear- 
Gaussian  model  because  the  tails  of  a  Gaussian  pdf  decay  extremely  rapidly,  and  they 
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therefore  cannot  accommodate  high-amplitude  events.  For  this  reason,  as  well  as  those 
cited  earlier,  the  classical  model  may  lack  the  flexibility  needed  to  provide  an  accurate  fit 
to  certain  waveforms  encountered  in  real-world  problems. 


1.4  The  Need  for  Non-Gaussian  Signal  Models 

In  practice,  we  rarely  have  perfect  knowledge  of  the  stochastic  structure  of  either  the  signal 
or  noise;  hence,  in  most  cases  where  the  linear-Gaussian  model  is  used,  it  is  intended  only 
as  a  nominal  approximation.  We  have  already  mentioned  several  applications,  however,  in 
which  either  the  signal  pdf  or  the  noise  pdf  deviates  considerably  from  this  nominal  Gaussian 
assumption.  A  number  of  additional  applications  that  are  known  to  involve  non-Gaussian 
phenomena  can  be  found  in  [91,  225,  226].  In  such  applications,  a  critical  question  that 
must  be  addressed  is  whether  a  moderate  amount  of  mismatch  between  the  actual  signal 
and  the  nominal  signal  model  will  lead  to  only  a  moderate  amount  of  degradation  in  overall 
signal  processing  performance. 

A  commonly  cited  example  demonstrating  the  potential  loss  in  performance  due  to  model 
mismatch  involves  the  coherent  reception  of  a  deterministic  waveform  in  additive  white 
Gaussian  noise.  It  is  well  known  that  the  best  possible  detector  for  this  problem  (in  the 
sense  that  it  minimizes  the  probability  of  decision  error)  is  the  matched  filter,  which  consists 
of  a  cross-correlation  of  the  observation  with  the  known  waveform  and  then  a  comparison  of 
the  result  to  a  fixed  threshold  [76].  Though  it  is  optimal  when  the  noise  is  truly  Gaussian, 
the  matched  filter  may  suffer  a  dramatic  decline  in  performance  if  the  noise  pdf  deviates  even 
slightly  from  the  nominal  Gaussian  form  [80,  81, 169, 173].  On  the  other  hand,  incorporating 
a  modest  amount  of  nonlinear  signal  processing  based  on  a  more  realistic  noise  model  can 
yield  a  detector  that  is  far  superior  to  the  matched  filter  [127,  107,  145,  120]. 

Other  examples  have  been  presented  in  the  literature  that  demonstrate  a  similar  lack 
of  robustness  with  the  linear-Gaussian  model  in  the  problems  of  source  identification  and 
signal  estimation  [90,  117, 118,  119,  122,  146].  Such  examples  underscore  the  need  for  more 
accurate  (and,  unavoidably,  more  complex)  signal  models  in  situations  where  a  severe  loss 
in  performance  cannot  be  tolerated. 


1.5  Proposed  Approaches  to  the  Inference  Problems 

1.5.1  Developing  Extensions  to  the  Classical  Model 

In  order  to  overcome  the  limitations  with  the  linear-Gaussian  model,  we  seek  to  develop 
more  realistic  models  that  will  allow  us  to  solve  the  problems  of  source  identification  and 
signal  estimation  when  they  involve  non-Gaussian  signals.  We  observe,  however,  that  the 
class  of  non-Gaussian  signals  is  immense  and  extremely  diverse,  even  when  it  is  restricted  to 
include  only  those  signals  that  axe  stationary.  Indeed,  to  define  a  signal  to  be  non-Gaussian 
is  to  characterize  it  by  default,  i.e.,  by  its  failure  to  possess  a  specific,  well  defined  statistical 
property.  For  this  reason,  it  is  virtually  impossible  to  develop  a  general,  unifying  framework 
that  applies  equally  well  to  all  signals  in  the  class. 
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Therefore,  our  initial  approach  to  creating  a  non-Gaussian  inference  framework  will 
be  to  focus  on  a  narrow,  well  defined  class  of  signals  that  is  often  considered  under  the 
Gaussian  assumption,  namely  the  class  of  linear  autoregressive  (AR)  signals.  We  make  a 
slight  mathematical  modification  to  this  signal  class  so  that  it  includes  non-Gaussian  as 
well  as  Gaussian  processes,  and  we  then  attempt  to  develop  solutions  to  our  two  inference 
problems  with  this  modified  model.  We  describe  approaches  based  on  this  new  model  in 
more  detail  in  the  following  subsection;  we  then  describe  a  second,  decidedly  different  signal 
model  that  has  been  designed  to  compensate  for  certain  computational  disadvantages  of  the 
initial  model. 

1.5.2  The  ARGMIX  Signal  Model 

The  first  model  we  will  consider  is  intended  to  be  a  direct  generalization  of  the  classical  AR 
linear- Gaussian  model.  Under  the  ARGMIX  model,  the  source  signal  {It}  is  characterized 
as  the  output  of  an  AR  linear  time-invariant  (LTI)  system  driven  by  a  white  non-Gaussian 
noise  process  of  a  special  type.  More  specifically,  {Yf}  is  assumed  to  obey  the  Kth  order 
difference  equation 


K 

Yt  =  Y,“kYt-k  +  Wu  (1.1) 

k=l 

where  {ak}k=i  are  the  real-valued  AR  coefficients  of  the  process  and  {Wt}  is  a  sequence 
whose  elements  are  independent  and  identically  distributed  (i.i.d.)  according  to  a  Gaussian- 
mixture  (GMIX)  pdf,  i.e.,  a  pdf  that  is  a  weighted  average  of  a  finite  number  of  Gaussian 
densities  having  arbitrary  means  and  variances.  We  refer  to  this  representation  for  the 
source  signal  as  the  ARGMIX  model. 

To  solve  the  source  identification  problem  for  the  ARGMIX  model,  recall  that  we  must 
generate  an  estimate  for  the  signal  parameter  vector  '3',  which  in  this  case  consists  of  not 
only  the  AR  parameters,  but  also  the  mixture  parameters  (i.e.,  the  means,  variances,  and 
weighting  coefficients  that  define  the  Gaussian-mixture  pdf).  Maximum  likelihood  (ML) 
estimates  for  this  problem  have  not  been  directly  pursued  in  the  past  because  the  likelihood 
function  is  unbounded  in  the  vicinity  of  certain  known,  degenerate  parameter  values.  In 
general,  these  degenerate  values  are  not  useful  as  estimates,  even  though,  strictly  speaking, 
they  do  maximize  the  likelihood  function. 

As  we  will  see  in  Chapter  2,  however,  strategies  based  on  finding  non-degenerate  local 
maxima  of  the  likelihood  function  yield  solutions  that  are  useful.  Indeed,  Titterington  et 
al  [200]  showed  that  the  approach  of  locally  maximizing  the  likelihood  function  is  useful 
for  the  related  problem  of  estimating  the  mixture  parameters  only,  i.e.,  the  problem  in 
which  the  LTI  system  is  known  to  be  an  identity  system.  These  researchers,  as  well  as 
others  [54,  121],  have  empirically  studied  the  performance  of  several  numerical  hill-climbing 
algorithms  for  computing  ML  estimates  of  the  mixture  parameters  and  have  found  that 
these  algorithms  often  produce  reasonable  results.  To  solve  the  more  complex  identification 
problem  in  which  all  of  the  ARGMIX  parameters  must  be  estimated  jointly,  we  show  that 
an  efficient  iterative  algorithm  can  be  constructed  based  on  the  expectation-maximization 
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(EM)  principle  [48].  We  refer  to  this  iterative  technique  as  the  EMAX  algorithm. 

Although  we  are  able  to  make  considerable  progress  in  ARGMIX  source  identification, 
unfortunately  the  problem  of  optimally  estimating  an  ARGMIX  signal  in  independent  ad¬ 
ditive  noise  appears  to  be  computationally  infeasible.  In  the  latter  part  of  Chapter  2,  we 
demonstrate,  using  the  simple  example  in  which  an  ARGMIX  signal  is  corrupted  by  white 
Gaussian  noise,  that  generating  an  MMSE  estimate  requires  a  number  of  computations 
growing  exponentially  with  the  number  of  samples  contained  in  the  observation.  For  the 
signal  estimation  problem,  therefore,  we  propose  only  approximate,  suboptimal  techniques. 

1.5.3  The  HMM- Based  Signal  Model 

In  Chapters  3,  4,  and  5,  we  introduce  and  develop  a  second  signal  model  as  a  way  of  over¬ 
coming  the  computational  difficulties  encountered  with  the  ARGMIX  structure.  This  new 
model  is  fundamentally  different  from  the  one  considered  above;  it  is  intended  to  repre¬ 
sent  a  given  stationary  AR  signal  only  approximately  as  the  output  of  a  finite-state  hidden 
Markov  model  (HMM).  A  representation  of  this  type  can  be  constructed  by  quantizing  the 
underlying  dynamics  of  the  actual  signal.  To  carry  out  the  construction,  we  first  partition 
the  state  space  associated  with  the  original  signal  into  several  disjoint  regions  and  assign 
each  region  to  a  unique  state  of  the  Markov  chain  in  the  approximating  HMM.  We  then 
specify  a  set  of  appropriate  initial  state  probabilities  and  state  transition  probabilities  for 
this  Markov  chain.  After  specifying  the  finite-state  representation  of  the  signal  dynamics  in 
this  way,  we  then  complete  the  overall  approximation  by  assigning  an  appropriate  output 
pdf  to  each  state  of  the  HMM. 

This  new  signal  model  allows  us  to  develop  computationally  efficient  algorithms  for 
inference  problems  in  which  the  source  signal  {Yt}  is  described  by  the  more  general  nonlinear 
difference  equation 


Yt  =  h{Yt-i,  —  ,Yt-K,Wt),  (1.2) 

rather  than  the  traditional  linear  form  given  in  (1.1).  However,  before  the  HMM-based 
model  can  be  used  to  perform  either  source  identification  or  signal  estimation,  we  must  first 
address  the  basic  issue  of  random  process  approximation,  i.e.,  we  must  determine  how  to 
best  represent  the  true  signal  by  an  HMM  when  the  pdf  of  the  signal  is  precisely  known. 
This  problem,  which  we  discuss  in  Chapter  3,  can  be  solved  by  minimizing  a  properly  chosen 
distance  measure  between  the  approximate  and  actual  densities.  The  solution  provides  us 
with  a  number  of  theoretical  criteria  that  must  be  satisfied  by  the  components  of  the  optimal 
HMM-based  approximation. 

In  Chapter  4,  we  use  the  theoretical  criteria  derived  in  the  signal  approximation  problem 
as  guidelines  for  developing  a  practical  source  identification  algorithm.  This  algorithm  is 
designed  to  iteratively  adjust  the  region  boundaries  of  the  state-space  partition  to  find  the 
best  HMM-based  approximation,  using  only  a  finite-length  realization  of  the  true  signal. 
The  algorithm  therefore  allows  us  to  obtain  working  HMM-based  models  of  arbitrarily 
complicated  AR  signals,  which  can  then  be  applied  to  the  problem  of  signal  estimation. 
Techniques  for  performing  signal  estimation  based  on  the  HMM  paradigm  are  described 
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in  Chapter  5.  The  basic  computational  engine  for  these  techniques  is  based  on  related 
existing  methods  that  have  been  developed  for  automatic  speech  recognition,  where  HMMs 
are  now  widely  used  [78,  85].  Building  on  this  previous  research,  we  create  a  powerful  new 
technique  for  dealing  with  independent  additive  noise  whose  samples  may  in  general  be 
both  non-Gaussian  and  temporally  dependent. 


1.6  Prior  Work  on  Non-Gaussian  Inference  Problems 

1.6.1  Non-Gaussian  Source  Identification 

The  most  popular  methods  for  estimating  parameters  of  non-Gaussian  processes  have  been 
based  on  higher-order  statistics  (HOS)  (see,  for  example,  [123, 133, 134, 135],  and  associated 
references).  Most  techniques  of  this  kind  have  been  based  on  a  signal  model  that  is  similar 
to  the  ARGMIX  model,  in  that  the  observed  process  is  assumed  to  be  the  output  of  an  LTI 
system  driven  by  white  non-Gaussian  noise.  Early  versions  of  the  methods  currently  used 
were  first  proposed  by  Giannakis  [64],  and  were  further  analyzed  and  extended  by  Giannakis 
and  Mendel  [65],  Porat  and  Friedlander  [148],  and  Tugnait  [209,  210,  211].  These  methods 
are  generally  robust  in  the  presence  of  observation  noise,  are  fairly  easy  to  implement,  and 
make  few  assumptions  about  the  pdf  of  the  AR  process.  However,  according  to  Mendel  [123], 
because  they  extract  much  of  their  information  about  the  observed  process  by  computing 
sample  moments  or  cumulants  above  second  order,  HOS-based  methods  tend  to  produce 
high-variance  parameter  estimates,  particularly  when  the  length  of  the  data  record  is  small. 

The  approach  based  on  the  ARGMIX  model  is  fundamentally  different  from  the  HOS- 
based  approach  in  that  it  assumes  a  specific  form  for  the  pdf  of  the  observed  data,  and  is 
therefore  entirely  parametric.  The  Gaussian-mixture  model  is  capable  of  closely  approxi¬ 
mating  many  densities,  and  has  been  considered  by  a  number  of  researchers  for  this  purpose 
(see,  for  example,  [50,  200,  156,  54,  121]).  Yet  apparently  only  a  few  researchers,  most  no¬ 
tably  Sengupta  and  Kay  [171]  and  Zhao  [232],  have  previously  considered  Gaussian-mixture 
models  in  conjunction  with  AR  systems.  Sengupta  and  Kay  [171]  have  addressed  the  prob¬ 
lem  of  ML  estimation  of  AR  parameters  for  ARGMIX  processes  in  which  only  two  Gaussian 
pdfs  constitute  the  mixture,  each  with  zero  mean  and  known  variance,  but  with  unknown 
relative  weighting.  They  used  a  conventional  Newton-Raphson  optimization  algorithm  that 
is  initialized  by  the  least-squares  solution  to  find  ML  estimates  for  the  AR  parameters  and 
for  the  single  weighting  coefficient,  and  showed  that  the  performance  of  the  ML  estimate  is 
superior  to  that  of  the  standard  forward-backward  least-squares  method.  However,  a  po¬ 
tentially  serious  limitation  of  their  algorithm  is  that  it  does  not  always  converge.  Moreover, 
because  they  have  examined  such  a  highly  constrained  version  of  the  ARGMIX  model,  it  is 
unclear  whether  the  algorithm  can  be  easily  generalized. 

In  a  separate  investigation,  Zhao,  et  al.  [232]  also  considered  ML  estimation  of  the  AR 
parameters  of  ARGMIX  processes  and  derived  a  set  of  linear  equations  whose  solution  gives 
the  ML  estimate  for  the  AR  parameters  when  all  the  mixture  parameters  are  known.  When 
the  mixture  parameters  are  unknown ,  they  combine  these  linear  equations  with  a  clever  ad 
hoc  clustering  technique  to  produce  an  iterative  algorithm  for  obtaining  a  joint  estimate  of 
both  the  AR  parameters  and  the  mixture  parameters.  They  do  not  guarantee  convergence 


Chapter  1.  Introduction 


21 


of  this  algorithm  or  optimality  of  the  estimate  in  any  sense,  but  they  have  demonstrated 
empirically  that  the  performance  of  their  algorithm  is  superior  to  that  of  cumulant-based 
methods  in  certain  cases.  The  primary  limitation  of  their  algorithm  is  that  it  cannot  produce 
unbiased  estimates  of  the  means  of  the  Gaussian  densities  in  the  mixture  whenever  two  or 
more  of  the  true  means  coincide.  This  unavoidable  bias  in  turn  degrades  the  AR  parameter 
estimates. 


1.6.2  Non-Gaussian  Signal  Estimation 

Generating  an  MMSE  estimate  of  a  non-Gaussian  signal  in  additive  noise  requires,  in  gen¬ 
eral,  a  processing  scheme  that  is  nonlinear;  hence,  methods  that  have  been  developed  for 
the  non-Gaussian  signal  estimation  problem  are  commonly  referred  to  as  nonlinear  filtering 
techniques.  Most  of  these  techniques  are  based  on  a  state-space  measurement  model  having 
the  form 


Xt  =  H(Xt_x,Wt)  (1.3) 

It  =  G(Xt)  (1.4) 

Zt  =  Yt  +  Vt,  (1.5) 

where,  in  the  ATth  order  AR  case,  the  state  vector  X*  is  defined  (without  loss  of  generality) 
as  Xt  =  {Yt,  Ytl,  ■  ■  -  ,  Yi-k)  and  G(-)  is  a  function  that  merely  returns  the  first  element  of  its 
vector  argument.  When  this  model  is  used,  the  recursion  that  characterizes  the  evolution  of 
the  density  of  the  state  vector  X$  based  on  the  vector  of  observations  Zo :t  =  {Zq,  Zi,  •  •  ■  ,  Zt) 
is  given  by  [189] 


f{xt  I  Z 0:t)  =  c  •  /( Zt  I  Xt)/(xt  |  Z0:t_l)  (1.6) 

/(xt+l  I  Zo :t)  =  J  /(xt+ 1  I  Xt)/(Xi  I  ZO:t)  d.Xt,  (1.7) 

where  c  is  a  normalizing  constant.  These  two  equations  are  often  termed  the  measurement 
update  and  time  update  formulas,  respectively.  Unfortunately  the  recursion  defined  by 
these  equations  cannot  usually  be  solved  in  closed  form.  Thus,  most  nonlinear  filtering 
techniques  described  in  the  literature  are  practical,  ad  hoc  methods  for  computing  approx¬ 
imate  solutions  to  (1.6)  and  (1.7). 

By  far,  the  most  popular  approach  to  the  nonlinear  filtering  problem  has  been  the 
extended  Kalman  filter,  or  EKF  [14,  181].  With  this  technique,  a  linear  Taylor  series 
expansion  of  the  system  dynamics  is  calculated  in  the  vicinity  of  the  current  state  vector 
estimate,  and  then  the  usual  linear  Kalman  filtering  formulas  are  applied.  Variations  on 
the  basic  EKF  technique  have  been  proposed  by  Wishner  et  al  [229]  and  by  Gelb  [63]. 
Further  enhancements  of  the  technique  can  be  obtained  by  incorporating  the  second-order 
terms  from  the  original  Taylor  series  expansion  [16,  83,  182].  It  has  been  demonstrated 
empirically  that  the  EKF  and  its  variants  often  yield  satisfactory  estimation  performance; 
however,  for  many  cases  in  which  the  signal-to-noise  ratio  is  only  moderate  or  low,  or  in 
which  the  densities  of  certain  model  variables  are  not  adequately  characterized  by  their 
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low-order  moments,  the  EKF  is  known  to  diverge  [188]. 


A  wide  range  of  alternative  methods  for  solving  (1.6)  and  (1.7),  which  are  quite  dif¬ 
ferent  from  the  EKF,  have  also  appeared  in  the  literature.  Many  of  these  techniques  are 
based  not  on  a  Taylor  series  approximation  of  the  system  dynamics,  but  rather  on  a  direct 
approximation  of  the  posterior  state  density  using  a  discrete  grid  of  points  in  state  space. 
Among  these  alternative  techniques  are  the  point-mass  approach  [32,  34,  39],  in  which  the 
posterior  state  density  is  approximated  by  a  probability  mass  function  defined  on  the  grid; 
the  Gaussian-sum  approach  [184,  11],  in  which  the  state  density  is  represented  by  a  weighted 
combination  of  purely  Gaussian  densities  (each  centered  at  a  different  point  on  the  grid); 
and  the  spline-based  approach  [46,  99,  102,  221,  222],  in  which  the  density  is  approximated 
using  polynomial  segments,  and  the  grid  points  themselves  serve  as  knots  for  the  spline. 


All  of  these  grid-based  approaches  are  implemented  using  the  same  basic  sequence  of 
processing  stages.  First,  an  initial  set  of  grid  points  is  defined  in  such  a  way  that  the  region 
encompassed  by  the  grid  accounts  for  nearly  all  of  the  true  posterior  probability  mass.  Then, 
at  each  new  time  index,  the  values  associated  with  these  grid  points  are  updated  using  the 
Bayesian  formulas  (1.6)  and  (1.7)  and,  simultaneously,  the  locations  of  the  grid  points  are 
adjusted  so  that  the  grid  once  again  encompasses  most  of  the  true  probability  mass.  For 
most  of  the  grid-based  approaches,  the  approximate  Bayesian  updates  are  performed  using 
numerical  integration;  however,  other  methods  of  updating  have  been  suggested  which  use 
random  sampling  of  the  densities  involved  [38,  68,  192,  194].  In  any  case,  the  values  as  well 
as  the  locations  of  the  grid  points  are  continually  modified  over  time  so  that  an  adequate 
representation  of  the  actual  state  pdf  is  maintained. 


Although  reasonably  good  performance  can  be  obtained  using  grid-based  approaches 
in  many  practical  problems,  there  are  several  undesirable  properties  associated  with  the 
methods  that  have  been  developed  to  date.  A  major  drawback  is  that  these  methods 
are,  in  general,  very  computationally  expensive.  Much  of  the  computation  is  spent  on 
numerically  evaluating  the  multidimensional  integral  in  (1.7);  specifically,  if  the  grid  contains 
J  points,  then  this  numerical  integration  requires  0(J2)  evaluations  of  the  measurement 
density.  Another  weakness  of  grid-based  approaches  is  the  lack  of  an  optimality  principle  to 
guide  the  assignment  of  the  grid  parameters.  Although  many  researchers  have  pointed  out 
the  flexibility  of  grid-based  methods,  none  have  apparently  formulated  the  grid  selection 
procedure  as  an  optimization  problem;  instead,  they  provide  only  coarse  rules  of  thumb 
indicating,  for  example,  how  the  grid  points  should  be  arranged  geometrically  in  state  space 
at  each  time.  In  certain  cases,  no  rules  are  provided  at  all;  rather,  only  the  possibility  for 
redefining  the  grid  in  some  useful  manner  (e.g.,  adding  or  removing  particular  grid  points, 
or  changing  the  spacing  between  existing  grid  points)  is  suggested.  Because  the  grid  itself 
is  never  optimized  in  any  way,  there  exists  the  potential  for  wasted  computation  or  for  the 
accrual  of  unnecessarily  large  errors  in  the  density  approximation  as  the  grid  evolves  over 
time. 
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1.7  Thesis  Overview  and  Outline 

The  thesis  consists  of  a  total  of  six  chapters,  including  this  introductory  chapter.  The 
chapters  that  make  up  the  core  of  the  technical  material,  namely  Chapters  2  through  5,  fall 
naturally  into  two  main  parts.  The  first  part,  which  consists  solely  of  Chapter  2,  examines 
inference  problems  involving  the  ARGMIX  signal  model;  the  second  part,  which  consists 
of  Chapters  3  through  5,  develops  the  theory  and  algorithms  for  the  HMM-based  signal 
model.  Below  we  give  a  brief  description  of  the  material  contained  in  each  of  the  remaining 
chapters. 

In  Chapter  2,  we  focus  exclusively  on  source  identification  and  signal  estimation  when 
the  source  signal  is  described  by  the  ARGMIX  model.  The  emphasis  is  placed  heavily  on 
source  identification,  since  this  is  more  tractable  of  the  two  problems.  We  develop  a  general 
iterative  algorithm,  which  we  term  the  EMAX  algorithm,  for  estimating  the  AR  parameters 
as  well  as  the  means,  variances,  and  weighting  coefficients  of  the  Gaussian-mixture  pdf.  In 
the  latter  part  of  the  chapter,  we  briefly  examine  the  problem  of  estimating  an  ARGMIX 
signal  that  has  been  corrupted  by  independent  additive  white  Gaussian  noise.  We  show 
that  an  optimal  solution  to  this  problem  can  be  readily  derived,  but  that  this  solution  is 
impractical  to  implement  because  it  requires  too  much  computation. 

In  Chapter  3,  we  begin  our  development  of  the  concept  that  an  arbitrary  stationary  AR 
process  can  be  usefully  represented  by  a  finite-state  HMM.  The  HMM-based  signal  model 
is  introduced  as  an  alternative  to  the  ARGMIX  model  to  reduce  the  computational  burden 
incurred  under  the  ARGMIX  assumption.  We  first  define  optimization  criteria  that  allow 
us  to  determine  how  to  best  approximate  the  true  random  signal  by  an  HMM  of  fixed  order. 
We  then  derive  analytical  formulas  for  the  optimal  HMM  parameters.  While  much  of  our 
initial  analysis  assumes  the  true  signal  is  AR  having  order  one,  we  show  that  the  results 
also  apply  to  higher-order  AR  signals. 

In  Chapter  4,  we  develop  a  practical  algorithm  for  estimating  the  parameter  values  of 
the  best  HMM-based  representation  of  a  stationary  signal  using  only  a  finite-length  obser¬ 
vation.  This  algorithm  can  therefore  be  viewed  as  a  way  of  solving  the  source  identification 
problem,  at  least  in  an  approximate  sense.  The  algorithm  is  designed  to  iteratively  adjust 
the  boundaries  of  the  regions  making  up  the  state-space  partition  until  the  optimal  partition 
is  reached.  The  basic  ideas  used  to  guide  the  iterative  search  are  drawn  from  the  theoretical 
results  derived  in  Chapter  3. 

In  Chapter  5,  we  describe  how  an  HMM-based  representation  of  a  non-Gaussian  process 
can  be  used  to  solve  the  signal  estimation  problem.  We  begin  by  constructing  a  smoothing 
algorithm  for  the  simplest  case  in  which  the  signal  and  noise  are  jointly  characterized  by 
a  Gaussian  pdf.  For  this  case,  we  show  that  only  a  few  states  are  required  in  the  finite- 
state  model  for  the  measurement  to  achieve  near-optimal  estimation  performance.  We  then 
extend  the  basic  smoothing  algorithm  so  that  it  applies  to  the  cases  in  which  the  additive 
noise  may  be  non-Gaussian  and  even  temporally  correlated. 

In  Chapter  6,  we  briefly  summarize  the  highlights  of  our  work,  discuss  the  main  thesis 
contributions,  and  provide  suggestions  for  future  related  research  in  non-Gaussian  signal 
processing. 
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1.8  Remarks  on  Notation 


We  adopt  the  usual  convention  of  writing  random  variables  in  upper  case  and  particular 
realizations  of  random  variables  in  lower  case.  If  X  is  a  random  variable,  then  we  denote 
its  pdf  by  fx(-)-  If  X  takes  values  from  a  set  containing  finitely  many  elements,  its  pdf 
will  contain  impulses  (i.e.,  Dirac  delta  functions),  but  in  such  cases  this  pdf  will  be  used 
only  under  appropriate  integrals.  If  Y  is  also  a  random  variable,  then  the  conditional  pdf 
of  X  given  Y  is  written  fx\y('\’)-  II  these  densities  depend  on  a  parameter  6  then  they  are 
written  as  fx{''i  8)  and  fx\y(‘\'i  @)-  respectively.  Expectations  and  conditional  expectations 
associated  with  densities  that  depend  on  a  parameter  8  are  analogously  denoted  by  E{-\ 0} 
and  E{-|-;0},  respectively.  Vector- valued  variables,  both  random  and  deterministic,  are 
written  in  boldface.  If  x  is  an  n-dimensional  vector,  then  the  ith  element  of  x  is  denoted 
by  Xi  for  i  =  1,  •  -  ■  ,  n.  Finally,  we  use  the  function  definition 


•A f(w;p,  o) 


(w-p)2) 

2  a2  J  ’ 


— oo  <  w  <  oo, 


(1.8) 


as  a  compact  notation  for  the  Gaussian  pdf,  since  this  density  is  used  frequently  in  the 
remaining  chapters.  A  summary  of  much  of  the  additional  notation  used  in  the  thesis  can 
be  found  in  Appendix  A. 


Chapter  2 


Using  the  ARGMIX  Signal  Model 
for  Non- Gaussian  Inference 


2.1  Introduction 

In  this  chapter,  we  begin  the  technical  core  of  the  thesis  by  developing  an  inference  frame¬ 
work  for  the  ARGMIX  signal  model,  which  was  briefly  described  in  the  introduction.  In  the 
following  two  subsections,  we  outline  the  basic  assumptions  and  notation  that  will  be  used 
in  conjunction  with  the  ARGMIX  model,  and  we  give  concise  formulations  of  the  source 
identification  problem  and  the  signal  estimation  problem  based  on  this  model.  In  the  third 
subsection,  we  describe  how  the  remaining  material  in  the  chapter  is  organized. 

2.1.1  Preliminary  Assumptions  and  Notation 

We  consider  a  discrete-time  scalar-valued  random  process  {It}  that  satisfies  the  Ath-order 
autoregressive  difference  equation 


K 

Yt  =  Y,o-kYt-k  +  Wu  (2.1) 

Jt=i 

where  {afc}{Lx  are  the  real-valued  AR  coefficients  of  the  process,  and  {Wt}  is  a  sequence 
(termed  the  driving  process  or  driving  noise)  that  consists  of  i.i.d.  random  variables  having 
a  Gaussian-mixture  pdf  defined  by 


M 

fw(w)  =  52  Pi  N(w; (2.2) 

*5=1 

where  the  weighting  coefficients  satisfy  pi  >  0  for  i  =  1, 2,  -  •  ■  ,  M  and  YliLi  Pi  =  1- 

Alternatively,  we  can  express  the  tth  sample  of  the  driving  process  as 


Wt  =  a($t)Ut  +  p($t), 


(2.3) 
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where  {Ut}  is  a  sequence  of  i.i.d.,  zero-mean,  unit-variance  Gaussian  random  variables,  o 
and  p  are  mappings  defined  by  a(i)  =  cq  and  p(i)  =  m  for  i  =  1,2,---  ,  M,  {$4}  is  a 
sequence  of  i.i.d.  discrete-valued  random  variables  distributed  according  to  the  probability 
law  Pr($t  =  i)  =  pi  for  i  =  1,2,  ••  •  ,  M,  and  the  processes  {Ut}  and  {$<}  are  assumed 
statistically  independent.  The  representation  of  the  driving  process  given  in  (2.3)  will  be 
very  useful  in  the  derivation  of  our  parameter  estimation  algorithm  in  Section  2.2. 

2.1.2  Problem  Statement  and  Approach  to  Solution 
2. 1.2.1  ARGMIX  Source  Identification 

For  the  source  identification  problem,  we  assume  that  the  order  of  the  autoregression,  K, 
and  the  number  of  constituent  densities  in  the  Gaussian-mixture  pdf,  M,  are  given,  and 
that  the  parameters 


P  =  (PH,  M2,  *  -  *  ,Pm)  (2.4) 

cr  =  (<7i,CT2,---  ,crM)  (2.5) 

P=  {pi,P2,--  ,Pm)  (2-6) 

a=  (a1,o2,---  ,aK)  (2.7) 

are  unknown.  In  addition,  we  assume  that  the  random  variables  Y-k,Y-k+ !,•••  jTzv-i 


take  the  values  y-x,y~K+i,  •  •  •  ,  y/v-i,  respectively,  and  we  wish  to  estimate  the  parameter 
vector 


#  =  {p,  cr,  p,  a)  (2.8) 

based  on  our  observation.  For  notational  convenience,  we  define  the  random  vectors  Y  = 
(F0,  Yi,  ■  ■  ■  ,  Yjv-i)  and  Yf  =  (Ft-i,  Ft- 2,  •  •  •  Ft-#)  for  t  —  0, 1,  •  •  •  , N,  and  denote  the  real¬ 
izations  of  these  vectors  by  y  and  yj,  respectively. 

As  mentioned  in  Chapter  1,  we  are  not  strictly  seeking  an  ML  estimate  because,  in  most 
cases,  degenerate  estimates  exist  that  have  infinite  likelihood.  To  see  how  such  degenerate 
estimates  can  arise,  one  can  easily  verify  that  if  we  put,  say,  di  =  0  for  i  —  1,2,  -  -  -  ,K, 
( p,i ,  <7 i,pi)  =  (0, 1, 1/M)  for  i  =  2,  -  -  -  ,  M,  and  =  yo,  and  then  let  di  0,  then  the  likeli¬ 
hood  function  /Y0,Y(yoi  y;  '&')  will  increase  without  bound.  This  assignment  of  parameter 
values  corresponds  to  choosing  the  unknown  AR  system  to  be  an  identity  system  and  one 
of  the  Gaussian  densities  in  the  mixture  to  be  an  impulse  centered  directly  on  one  of  the 
observations. 

It  is  apparent  from  (2.2)  that  degenerate  estimates  can  be  obtained  only  if  one  or  more 
of  the  standard  deviation  estimates  is  chosen  to  be  zero.  We  may  be  tempted  to  avoid 
this  problem  by  restricting  all  of  the  standard  deviation  estimates  to  be  greater  than  some 
prespecified  positive  threshold.  However,  if  this  minimum  threshold  is  set  too  low,  then 
meaningless  estimates  can  arise  when  the  largest  likelihood  value  occurs  on  the  boundary 
of  the  restricted  parameter  space  near  a  singularity  at  which  di  =  0  for  some  i.  Yet  if  the 
threshold  is  set  too  high,  we  risk  excluding  the  best  available  estimate,  since  a  component  of 
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the  true  Gaussian-mixture  pdf  may  have  a  standard  deviation  smaller  than  the  artificially 
set  threshold. 

One  alternative  to  maximizing  the  likelihood  function  is  to  find  the  parameters  that 
achieve  the  largest  of  the  finite  local  maxima  [50].  In  general,  no  closed-form  solution  exists 
for  this  estimate,  and  a  numerical  method  must  typically  be  used.  Because  the  likelihood 
surface  may  have  numerous  local  maxima,  there  is  no  guarantee  that  classical  optimization 
techniques  will  find  the  largest  local  maximum.  Yet  Titterington  [200]  has  found  that 
methods  based  on  finding  local  maxima  (not  necessarily  the  largest  finite  local  maximum) 
yield  useful  estimates.  Accordingly,  we  take  the  approach  of  searching  for  local  maxima  of 
the  likelihood  function  using  the  generalized  expectation-maximization  (EM)  algorithm. 

More  formally,  if  we  let  V  denote  the  set  of  all  possible  values  for  the  parameter  vector 
'3',  then  the  estimate  we  seek  for  'Sf  is  any  satisfying 

$  €  arg  max  {log  /Yo, y (yo,  y;  ’*'')} 

*'€V 

=  argmax{log/Yo(yo;  *')  +  log/Y|Yo(y[yo;  *')}  , 

where  the  notation  argmlxa.eT>{g(r)}  stands  for  the  set  of  all  parameter  values  in  V  achiev¬ 
ing  finite  local  maxima  of  g. 

Since  the  estimate  ^  is  defined  in  terms  of  the  likelihood  function,  but  is  not  obtained 
through  a  standard  global  maximization,  we  refer  to  this  estimate  as  a  quasi-maximum  like¬ 
lihood  (QML)  estimate.  In  the  sequel,  we  shall  assume  that  N  »  K,  i.e.,  that  the  number  of 
samples  in  the  observed  sequence  is  much  greater  than  the  number  of  AR  parameters  to  be 
estimated.  Under  this  assumption,  we  may,  as  is  standard  in  the  derivation  of  ML  estimates 
for  Gaussian  AR  processes,  ignore  the  first  term  of  the  log-likelihood  function  appearing  on 
the  right-hand  side  of  (2.10)  and  assume  that  a  QML  estimate  is  any  ’3 >  satisfying 

§  €  arg  max  {log/Y]Yo(y|y0;  *')}  •  (2-11) 

*'£-p 

2. 1.2.2  ARGMIX  Signal  Estimation 

In  the  signal  estimation  problem,  we  are  not  given  a  clean  observation  of  {Yf},  as  we  were 
in  the  previous  case.  Instead,  we  observe  the  signal  only  after  it  has  been  corrupted  by 
additive  white  Gaussian  noise;  hence,  each  element  of  the  observed  sequence  {Zt}  has  the 
form 


(2.9) 

(2.10) 


Zt=Yt  +  Vu 


(2.12) 


where  { V*}  is  a  sequence  of  i.i.d.  random  variables,  each  having  a  pdf  /v(-)  defined  by 


fv{v)  =  Af(v;0,av). 


(2.13) 


All  random  variables  in  the  observation-noise  sequence  (V)}  and  in  the  driving-noise  se¬ 
quence  {Wt}  are  understood  to  be  mutually  independent. 
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For  this  problem,  we  assume  that  we  know  the  true  value  of  the  signal  parameter  vector 
'Er  as  well  as  that  of  the  noise  standard  deviation  cry.  In  addition,  we  assume  that  we  are 
given  realizations  of  the  first  N  samples  of  the  sequence  {Zt}.  Given  that  we  have  observed 
the  event  Zo:iv-i  =  zo^-ij  our  objective  is  to  produce  an  MMSE  estimate  of  the  underlying 
signal  realization  yo:jv-i-  It  is  straightforward  to  show  that  the  desired  estimate  yo:jv-i  is 
the  conditional  mean  vector  given  by 

yO:AT-l  =  E{yb:iV-l|Zo:iV-l  =  Z0:iV-i; ’*'}  •  (2-14) 

As  we  will  discover  later  in  the  chapter,  the  specific  mathematical  form  of  this  estimate 
is  fairly  straightforward  to  derive,  but  the  estimate  itself  is  often  impractical  to  compute. 
Thus,  we  suggest  several  possible  suboptimal  estimators  in  this  case. 

2.1.3  Chapter  Organization 

The  chapter  is  organized  in  the  following  way.  We  begin  by  providing  a  brief  overview  of 
the  EM  and  generalized  EM  principles.  We  then  use  this  EM  theory  to  derive  an  iterative 
method,  referred  to  as  the  EMAX  algorithm,  which  jointly  estimates  the  AR  parameters  and 
mixture  parameters  of  an  ARGMIX  process  via  the  QML  approach  described  above.  Next, 
we  present  and  discuss  four  separate  applications  of  the  EMAX  algorithm  and  compare, 
through  computer  simulations,  the  performance  of  our  algorithm  to  that  of  the  standard 
least-squares  technique  as  well  as  to  that  of  previously  developed  algorithms  based  on  a 
similar  signal  model.  In  the  latter  part  of  the  chapter,  we  also  derive  a  useful  variant  of 
the  EMAX  algorithm;  this  alternative  technique  is  designed  to  estimate  the  AR  parameters 
and  the  overall  gain  associated  with  the  ARGMIX  process  based  on  the  assumption  that 
the  functional  form  of  the  driving-noise  pdf  is  precisely  known.  We  then  derive  and  ana¬ 
lyze  a  theoretical  solution  to  the  ARGMIX  signal  estimation  problem.  Finally,  we  discuss 
the  advantages,  limitations,  and  possible  extensions  of  the  various  estimation  techniques 
developed  in  the  chapter. 


2.2  ARGMIX  Source  Identification  Using  the  EM  Principle 

2.2.1  Theory  of  the  EM  and  GEM  Algorithms 

The  EM  and  GEM  algorithms,  which  were  first  proposed  by  Dempster  et  al.  [48],  are  itera¬ 
tive  techniques  for  finding  local  maxima  of  likelihood  functions.  Although  their  convergence 
rates  are  slow,  these  algorithms  converge  reliably  to  local  maxima  of  the  likelihood  function 
under  appropriate  conditions,  require  no  derivatives  of  the  likelihood  function,  and  often 
yield  equations  that  have  an  intuitively  pleasing  interpretation. 

The  EM  and  GEM  algorithms  axe  best  suited  to  problems  in  which  there  is  a  “complete” 
data  specification  Z,  from  which  the  original  observations  Y  can  be  derived,  and  such  that 
the  expectation  2?{log  /z(Z;  j  Y  =  y;  ’3'"}  can  be  easily  computed  for  any  two  parameter 
vectors  'Sf' .  'S’"  €  V.  For  our  problem,  we  use  the  complete  data  specification  Z  =  (Y,  d>), 
where  $  is  the  vector  of  pdf-selection  variables  defined  by  $  =  (4>o,4>i,---  ,$jv-i)-  With 
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this  choice  of  complete  data,  the  EM  algorithm  as  applied  to  our  problem  generates  a 
sequence  of  estimates  according  to  the  recursive  formula 

^(5+l)  =  argmaX£{log/Y)#|Yo(Y,^|yo;^,)  I  Y  =  y,Y0  =  y0;^(s)},  (2.15) 

where  some  starting  estimate  must  be  chosen  to  initialize  the  recursion.  We  now  show 
that  the  sequence  of  estimates  {\E'^}£2.0  defined  above  satisfies  the  inequality 

iog/Y|Y0(y|yo;*(*+1))  >  iog/YiYo(y|yo;^(s))  (2.16) 

for  s  =  0, 1, 2,  •  •  • ;  that  is,  we  show  that  the  log-likelihood  value  associated  with  our  updated 
parameter  estimate  is  increased  at  each  iteration.  We  begin  by  writing  the  log-likelihood 
function  for  the  observed  data  with  parameters  SS?*  €  V  as 

log/Y|Yo(y|yo;  *')  =  log/Y,*iYo(y,<£iyo;  *')  -  log/*|Y)Yo(^|y,yo;  *')•  (2.17) 

Integrating  both  sides  of  (2.17)  with  respect  to  <f>  against  the  density  /#|Y)Yo($|y, yo;  3?^) 
gives 

log/Y|Yo(yjy0;  *')  =  E  {log/Y)#,Yo(y,  3>|y0;  ¥')  |  Y  =  y,  Y0  =  yo;  *(s)} 

-  E  {log/*|Y,Yo(<I>|y,yo;  *')  i  Y  =  y,  Y0  =  y0;  *(s)} 


(2.18) 

=  £/(*',  ^(s))  -  Y(¥',  *<*>),  (2.19) 

where  the  functions  U  and  V  are  defined  in  the  obvious  way.  Then  (2.15)  can  be  written  as 

=  arg  max  U(9',  *(s)).  (2.20) 

v’ev 

The  definition  of  V  together  with  Jensen’s  inequality  allows  us  to  conclude  that  V (<Er',  '$'(*))  < 
y(^,{«))  q>(s))  for  any  ip'  £  -p  Hence,  we  have 

iog/Y|Y0(y|yo;^')l^=^+i)  =  u(&s+1\*^)  -  v(&s+1\*M)  (2.21) 

>  U{&s+1\  ^(s))  -  V(®W,  ®>W)  (2.22) 

>  U(&s\  ^(s))  -  &s))  (2.23) 

=  log  /Y|y0  (y  |yo;  *')  l*'=*c> ,  (2.24) 


which  implies  that  the  EM  algorithm  gives  a  sequence  of  parameter  estimates  with  increasing 
likelihoods.  If  the  function  U  is  continuous  in  both  of  its  arguments,  the  sequence  of 
estimates  converges  to  a  stationary  point  of  the  log-likelihood  function  [231]. 


The  GEM  algorithm  is  an  alternative  form  of  the  EM  algorithm  that  is  often  easier  to 
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implement.  Such  an  algorithm  chooses  such  that 

U(^s+1\  *(i))  >  U(*(s\  *(s))  (2.25) 

at  each  iteration  s.  It  does  not  necessarily  select  such  that  (2.20)  is  satisfied.  Using 

the  same  reasoning  we  used  to  go  from  (2.21)  to  (2.24),  we  see  that  a  GEM  algorithm  also 
produces  a  sequence  of  parameter  estimates  with  increasing  likelihoods.  Whether  the  limit 
of  this  sequence  of  estimates  is  a  stationary  point  of  the  likelihood  function  depends  on 
the  particular  rule  for  selecting  from  If  is  selected  so  that  it  is  a  local 

maximum  of  U( ’3'',  q>(5))  over  G  V,  then  the  sequence  converges  to  a  stationary  point  of 
the  likelihood  function  [112,  231].  We  will  use  this  local-maximum  rule  for  selecting  updated 
parameters  in  our  GEM  algorithm.  As  is  the  case  with  all  “hill-climbing”  algorithms,  the 
limit  of  the  sequence  of  estimates  generated  by  an  EM  or  GEM  algorithm  may  not  be  a 
global  maximum  of  the  likelihood  function.  Therefore,  choosing  judiciously  is  the  key 
to  obtaining  a  good  parameter  estimate.  A  simple  method  for  choosing  is  given  and 
empirically  shown  to  be  adequate  in  Section  2.3. 


2.2.2  Derivation  of  the  EMAX  Algorithm 

We  now  derive  the  EMAX  algorithm  by  using  a  GEM  method  that  chooses  to  be  a 

local  maximum  of  Ui'Jf',  q>(5))  over  ’3''  G  V.  We  let  \f,/  =  (p7,  cr',p',  a')  and  write  (2.20)  as 


^(s+1)  =  argmaxE{log/siY0(#|yo;p') 

a!  ,n' ,<t' *• 

+  log  /Y|*,Yo  (y|*,  yo;  a',  p',  o>)  \  Y  =  y,  Yo  =  y0;  *(s) }  -  (2.26) 
This  is  equivalent  to  solving  the  following  two  maximization  problems: 

P(i+1)  =  argmaxE  jlog/*|Y0(<%o;p')  I  Y  =  y,Y0  =yo;^(i)}  (2.27) 

(als+1! ,  p(s+1\  <t(s+11)  =  arg max  E  | log  /y|#,y0 (y l^>  yoi  V-'  ■> cr')  I 

Y  =  y,Y0  =  yo;*(s)}  (2.28) 

To  find  p(s+1l  so  that  (2.27)  is  satisfied,  we  first  define  the  functions  {dj}^L1  and  {Cj}jL1 


a# ,<r 


by 


dM)  = 


1  if  <f>  =  j, 

0  otherwise; 

N-l 

Cj{<po,4>i,  ■  ■  ■ 


(2.29) 

(2.30) 


t=o 


that  is,  Cj($>)  is  the  number  of  times  the  symbol  j  appears  in  the  vector  <h.  In  addition, 
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for  notational  convenience,  we  define  the  function  Ptj  by 


PtJ(*')  =  Pr{4>t  =  j  |  Y  =  y, Y0  =  y0;  *'} 


(2.31) 


for  all  €  V,  for  t  =  0. 1,  •  -  •  ,  N  —  1  and  j  =  1, 2,  •  -  -  ,  M.  Using  these  definitions,  the 
maximization  in  (2.27),  which  is  over  all  p'  such  that  p'  >  0  and  YljLiPj  =  1’  can  be 
written 


f  M 

p{-+l)  =  arg  max  E  log  p'^j(#) 

Y  =  y,Y0=yo:’S'(s)  - 

^  [  3= 1 

J 

{M 

E  logp'j 
3=1 
M  N- 1 

=  arg  max  EEw‘"N 
^  j= 1  4=0 

4=0 


Y  =  y,Y0  =  y0;¥(s) 


(2.32) 

(2.33) 

(2.34) 

(2.35) 


where  the  last  equality  follows  from  Jensen’s  inequality. 

To  attempt  the  maximization  in  (2.28),  we  use  the  knowledge  that  the  driving  process 
is  a  sequence  of  i.i.d.  Gaussian-mixture  random  variables  to  write  the  pdf  for  Y  conditioned 
on  $  and  Yq  as 


N-l 

/Y|*,Yo(y|*«yo;a\A»,,0’')  =  n  (2.36) 

4=0 

=  n  exp  (-rj-gfo  - |  •  (2-37) 

4=0  V2^,  L  2cr$t  J 

Notice  that  the  term  yt  —  y  J  a!  represents  the  residual  or  prediction  error  obtained  by  using 
a'  as  the  AR  parameter  vector.  The  function  being  maximized  in  (2.28)  can  then  be  written 
as 


^{log/Y|#,Yo(yl$?yo;a',/x',<r')  |  Y  =  y,  Y0  =  yo;^(i)}  = 


N 


N- 1  M 


N-l  M 


T  l°g27r-  E  Eptd(^(s))l°g^  -  E  EPm(*(s)) 


(yt  -  y?V  -  Pj) 


,M2 


4=0  j=l 


4=0  j=l 


(2.38) 


Taking  derivatives  of  this  expression  with  respect  to  the  quantities  p! .  and  a!  and 
setting  the  resulting  expressions  equal  to  zero  yields  three  coupled  nonlinear  equations  that 
define  a  stationary  point  of  the  right-hand  side  of  (2.38).  Because  we  are  unable  to  solve 
these  nonlinear  equations  analytically,  it  is  difficult  to  find  a  global  maximum.  We  instead 
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use  the  method  of  coordinate  ascent  to  numerically  find  a  local  maximum,  resulting  in  a 
GEM  algorithm  rather  than  an  EM  algorithm.  Coordinate  ascent  increases  a  multivariate 
function  at  each  iteration  by  changing  one  variable  at  a  time.  If,  at  each  iteration,  the 
variable  that  is  allowed  to  change  is  chosen  to  achieve  the  maximum  of  the  function  while 
the  other  variables  are  kept  fixed,  then  coordinate  ascent  converges  to  a  local  maximum  of 
the  function  [112].  Coordinate  ascent  is  attractive  because  it  is  simple  to  maximize  (2.38) 
separately  over  each  variable  as  follows: 

axgmax£;{log/Y|$;Yo(y|^>yo;(a',/xi,---  ,p'M,<r'))  I  Y 
ZZo  'PtjivV) 

argmax£'{log/Y|#,Yo(y|^?yo;(a',M,,^ii---  I  Y 

c'i 


(2.40) 


(2.41) 


Using  the  equations  above,  the  coordinate-ascent  algorithm  is  described  as  follows: 


E,=Vf<j(*M)fa-y?V-4)2 
\  ZtLo 

argmax  ^llog/Yi^YoCyl^.yoKa'.M',©-'))  |  Y  =  y,  Y0  =  y0;  ^(s)} 


JV-l  M 


Z-j  \2  ytyt 

t= 0  j= 1  '  3' 


-1  r 


,■(¥<*>), 

EE  \at\2  yt 

t= 0  7=1  K  3> 


=  y,Y0=y0;*(s)} 

(2.39) 

=  y,Y0  =  y0;*(s)} 


INITIALIZATION: 


af = Y 

3  =  !>••• 

,  M 

3  =  !,-■■ 

,M 

a<0>=a« 

ITERATION: 

(2.42) 

(2.43) 

(2.44) 


-0+1)  _ 

Et=ox  PtA*ls)) 


3  =  1,—  ,M 


(2.45) 


,(i+i) 


N 


E^o1  -  yfaW  -  Mi+1))2 


a(’+1)  = 


E^o1  ptA*{s)) 

§  h  (-ri})2 


3  =  1,— ,M  (2.46) 


EE%&.-Y,>y. 


C—J  t — t  /-0+l)\2 

t= 0  j= 1  \aj  ) 


(2.47) 
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If  this  recursion  is  iterated  for  i  =  0,  •  •  •  ,  J  —  1,  then  we  define  our  parameter  updates  by 
a(s+i)  _  ,  ^(s+i)  =  <T(S+1)  =  .  For  sufficiently  large  values  of  J,  the  updated 

parameters  are,  for  practical  purposes,  local  maxima  of  (2.38).  Since  the  EMAX  algorithm 
is  a  GEM  algorithm  that  chooses  the  updated  parameter  estimates  to  be  local  maxima 
of  (2.38),  it  converges  to  a  stationary  point.  In  summary,  then,  a  single  iteration  of  the 
EMAX  algorithm  consists  of  computing  {Ptj(\&(s))},  applying  (2.35),  and  iterating  (2.45)- 
(2.47)  until  convergence. 

As  shown  in  Figure  2-1,  the  EMAX  algorithm  can  be  conceptually  decomposed  into 
three  main  steps,  which  are  iterated  to  produce  the  final  parameter  estimates.  Observe 
that  the  filter  1  —  a\  z~l  can  be  interpreted  as  the  current  estimate  of  the  inverse  of 
the  AR  filter.  In  the  first  block  of  Figure  2-1,  this  inverse  filter  is  applied  to  the  observations 
to  produce  the  residual  sequence  w[s^  =  yt—yfaSs\  which  can  be  interpreted  as  an  estimate 
of  the  driving  noise.  This  residual  sequence  is  used  to  compute  the  posterior  probabilities 
{Ptj(^^)}.  Under  the  hypothesis  that  a(s)  is  the  true  AR  parameter  vector,  these  residuals 
are  statistically  independent.  Using  the  representation  for  the  driving  process  given  in 
(2.3),  we  may  take  the  view  that  each  sample  of  the  residual  sequence  is  a  particular 
realization  arising  from  one  of  M  randomly  chosen  classes,  where  the  pdf  characterizing 
the  jth  of  these  classes  is  A/"(-;  p.jS\o^).  For  the  fth  sample  of  the  driving  noise  sequence, 
the  value  of  the  class  label  j  is  determined  by  the  pdf-selection  variable  4>j.  Assuming 
that  the  mixture  parameters  are  a^s\  and  .  we  can  easily  compute  the  posterior 
probability  that  the  fth  sample  is  a  realization  from  class  j  using  Bayes’  rule;  this 

is  the  operation  being  performed  in  the  second  block  of  Figure  2-1.  With  these  posterior 
probabilities,  we  first  compute  the  updated  estimate  of  the  weighting  coefficient  vector 
according  to  (2.35).  We  then  compute  //s+1\  cr^+1\  and  a^s+1^  by  iterating  (2.45)- 
(2.47)  until  convergence  to  some  prespecified  numerical  tolerance  is  obtained;  this  operation 
is  represented  by  the  third  block.  As  shown  in  Figure  2-1,  the  process  is  repeated,  starting 
again  from  the  first  block,  until  convergence. 

A  single  iteration  of  (2.45)-(2.47)  has  the  following  intuitive  interpretation.  The  new 
estimate  for  the  mean  of  the  jth  class  is  a  weighted  time  average  of  the  residuals,  where 
the  weight  on  the  fth  residual  sample  is  proportional  to  the  posterior  probability  that  the 
sample  belongs  to  class  j.  The  new  estimate  for  the  variance  of  the  jth  class  is  a  weighted 
time  average  of  the  square  of  residuals  with  the  previously  computed  estimate  of  the  mean 
of  the  jth.  class  removed;  once  again,  the  weight  on  the  fth  residual  sample  is  proportional 
to  the  posterior  probability  that  the  sample  belongs  to  class  j.  The  new  estimate  for  the 
AR  coefficient  vector  is  updated  via  a  generalized  version  of  the  Yule- Walker  equations  [232] 
using  the  most  recent  estimates  of  the  means  and  variances. 

We  make  the  final  observation  that  if  the  values  of  the  parameters  in  any  subset  of 
the  3 M  mixture  parameters  are  known,  then  the  update  equations  for  the  parameter  esti¬ 
mates  can  easily  be  modified,  and  the  properties  of  the  EMAX  algorithm  will  be  preserved. 
Specifically,  we  simply  replace  the  parameter  updates  in  (2.35),  (2.45)-(2.47)  with  the  cor¬ 
responding  known  parameter  values.  Clearly,  updates  for  the  known  parameters  would  not 
be  performed  in  this  case. 
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Figure  2-1:  Block  diagram  representation  of  the  EMAX  algorithm 
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2.3  Numerical  Examples 

In  this  section,  we  present  several  examples  to  illustrate  the  behavior  and  performance  of 
the  EM  AX  algorithm.  An  implementation  of  the  EM  AX  algorithm  using  the  Matlab 
programming  language  is  given  in  the  appendix,  and  was  used  in  each  of  the  examples 
below.  The  examples  were  selected  with  several  objectives  in  mind:  (i)  to  verify  that  the 
EMAX  algorithm  behaves  as  expected  and  produces  results  consistent  with  those  obtained 
by  others  on  relevant  ML  estimation  problems;  (ii)  to  illustrate  that  the  EMAX  algorithm 
performs  significantly  better  in  certain  estimation  problems  than  either  conventional  least- 
squares  techniques  or  previously  proposed  algorithms  based  on  a  similar  data  model;  (iii)  to 
demonstrate  that  the  EMAX  algorithm  can  be  used  to  obtain  good  approximations  to  ML 
estimates  in  cases  where  the  functional  form  for  the  pdf  of  the  driving  process  is  unknown; 
and  (iv)  to  show  that  the  EMAX  algorithm  can  be  very  useful  in  common  signal  processing 
problems  where  the  primary  objective  is  to  recover  a  signal  from  corrupted  measurements. 

For  each  of  the  examples  presented  here,  we  found  that  the  following  simple  method  for 
generating  an  initial  parameter  estimate  p(°\a.W)  for  the  EMAX  algo¬ 

rithm  yielded  good  average  performance.  The  vector  a(°)  was  computed  using  the  forward- 
backward  least-squares  technique  from  traditional  AR  signal  analysis  [92,  115].  Each  of  the 
M  elements  of  the  mean  vector  p  was  randomly  generated  according  to  a  uniform  pdf  hav¬ 
ing  region  of  support  [mini{u>^}>  maxi{iu[0')}],  where  wj0^  is  the  tth  element  of  the  residual 
sequence  w^0)  produced  by  applying  the  filter  1  —  Yhf=\  to  the  sequence  of  obser¬ 

vations.  Each  element  of  cr  was  randomly  chosen  according  to  a  uniform  pdf  with  region 
of  support  [0,  maxt }  —  mint{u;[0^}].  Finally,  the  elements  of  the  weighting  coefficient 
vector  p(0)  were  all  set  equal  to  1/M.  For  special  cases  in  which  certain  elements  of  ’3'  were 
assumed  known,  no  initial  estimate  needed  to  be  chosen. 

2.3.1  Example  1:  Comparison  with  Previous  Work  (Part  I) 

We  begin  with  a  simple  example  for  which  numerical  results  have  already  been  reported  by 
Sengupta  and  Kay  [171].  For  direct  comparison  of  the  performance  of  our  EMAX  algorithm 
to  that  of  the  Sengupta-Kay  (S-K)  algorithm,  we  have  replicated  the  computer  simulations 
carried  out  in  their  previous  work.  The  problem  considered  by  those  authors  was  the  ML 
estimation  of  the  parameters  of  a  fourth-order  AR  process  whose  AR  coefficients  are  given 

by 


(oi,  o2,  a3, 04)  =  (1.352,  -1.338, 0.662,  -0.240).  (2.48) 

The  driving  noise  for  this  process  was  assumed  to  consist  of  i.i.d.  samples  distributed  ac¬ 
cording  to  the  two-component  Gaussian-mixture  pdf 

f\v{w)  =  Pi  Af{w;pi,oi)  +  p2Af(w;p2,cr2),  -00  <  u  <  00,  (2.49) 

where  the  mixture  parameters  p\,  pi,  cq,  p%,  p%,  and  02  are  defined  by 


(pi,Pi,<?i)  =  (0.9, 0.0, 1.0); 


(2.50) 
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Figure  2-2:  Power  spectral  density  of  fourth-order  AR  process  discussed  in  Example  1. 


(P2,P2,<x2)  =  (0.1,0.0,10.0).  (2.51) 

A  plot  of  the  power  spectral  density  of  this  process  is  shown  in  Figure  2-2. 

Sengupta  and  Kay  assumed  that  the  values  of  pi,  ci,  and  cr2  were  known,  and  that 
the  values  of  the  remaining  model  parameters  a*,  a2,  a 3,  G4,  and  p\  (and,  of  course,  P2,  since 
P2  =  1 — pi )  were  unknown.  They  developed  a  Newton-Raphson  algorithm  for  obtaining  ML 
estimates  of  the  AR  parameters  and  of  the  overall  variance  a2  associated  with  the  driving 
process,  which  is  given  by 


er2  =  piof  +  (1  -  Pi)^.  (2.52) 

Obtaining  an  ML  estimate  of  o 2  is,  in  this  case,  equivalent  to  obtaining  an  unconstrained 
ML  estimate  of  p\.  This  is  true  because  the  parameters  cr2  and  p\  stand  in  one-to-one 
correspondence,  and  the  ML  estimation  procedure  is  invariant  with  respect  to  such  invertible 
transformations  on  the  parameters  of  the  log-likelihood  function  [149]. 

As  was  done  in  [171],  we  performed  a  total  of  5000  trials.  On  each  trial,  a  sequence 
of  1000  data  points  was  generated  and  processed  using  the  EM  AX  algorithm.  The  sample 
means  and  variances  of  the  parameter  estimates  produced  by  the  EMAX  algorithm  are 
presented  in  Table  2.1  in  the  column  labeled  EMAX-KSD  (where  KSD  stands  for  known 
standard,  deviations) .  The  results  of  a  separate  simulation  in  which  the  standard  deviations 
were  assumed  to  be  unknown  are  also  listed  in  Table  2.1  in  the  column  labeled  EMAX  USD 
(where  USD  stands  for  unknown  standard  deviations).  Remarkably,  the  sample  variance  of 
the  estimates  of  the  AR  coefficients  increased  negligibly  for  the  case  in  which  the  standard 
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True 

Value 

Sample 

Mean 

(S-K) 

Sample 

Mean 

{  EMAX-KSD) 

Sample 

Mean 

(emax-usd) 

Sample 

Variance 

(S-K) 

Sample 

Variance 

(EMAX-KSD) 

Sample 

Variance 

(emax-usd) 

Cramer-Rao 

Bound 

(usd) 

ax 

1.352 

1.3527 

1.3518 

1.3518 

1.0219  X  10~4 

1.0727  X  10~4 

1.0782  X  10~4 

1.0491  X  10~4 

a2 

-1.338 

-1.3391 

-1.3378 

-1.3377 

2.4619  X  10-4 

2.5955  X  10~4 

2.6073  X  10~4 

2.5961  X  10-4 

a3 

0.662 

0.6629 

0.6619 

0.6619 

2.4253  X  10~4 

2.6125  X  10-4 

2.6225  X  10-4 

2.5961  x  10~4 

°4 

-0.240 

-0.2404 

-0.2402 

-0.2402 

1.0352  X  10-4 

1.0742  X  10-4 

1.0753  X  10~4 

1.0491  x  10“ 4 

<72 

10.900 

10-8544 

10.8963 

10.8946 

1.2061 

1.1655 

2.8941 

0.3149 

Table  2.1:  Sample  means  and  variances  for  parameter  estimates  from  Example  1.  Entries 
were  computed  using  results  of  5000  trials  for  (i)  the  algorithm  of  Sengupta  and  Kay  (S- 
K),  (ii)  the  EMAX  algorithm  with  known  standard  deviations  (EMAX-KSD),  and  (iii)  the 
EMAX  algorithm  with  unknown  standard  deviations  (EMAX-USD).  Cramer-Rao  bounds 
on  the  estimation  variances,  as  reported  by  Sengupta  and  Kay,  are  also  listed  for  the  case 
of  known  standard  deviations  (KSD). 


deviations  were  unknown.  However,  the  sample  variance  of  the  estimate  of  the  remaining 
parameter  o 2  increased  dramatically  over  that  for  the  case  in  which  the  standard  deviations 
were  known. 


We  observe  from  Table  2.1  that  the  estimate  of  o2  produced  by  the  EMAX-KSD  algo¬ 
rithm  has  less  bias  and  a  smaller  sample  variance  than  the  corresponding  estimate  produced 
by  the  S-K  algorithm.  A  possible  explanation  for  this  discrepancy  is  that  Sengupta  and 
Kay  did  not  constrain  their  estimate  of  p\  (which  is  a  function  of  <r2),  whereas  the  EMAX 
algorithm  appropriately  constrains  its  estimate  of  pi  to  be  between  0  and  1.  We  make  two 
further  observations  from  Table  2.1:  (i)  all  of  the  sample  means  associated  with  the  AR 
parameter  estimates  generated  by  the  S-K  algorithm  exhibit  slightly  more  bias  than  the 
sample  means  generated  by  the  EMAX  algorithm;  and  (ii)  all  of  the  sample  variances  of 
these  same  estimates  generated  by  the  S-K  algorithm  are  below  the  Cramer-Rao  bound, 
whereas  only  one  of  the  sample  variances  generated  by  the  EMAX  algorithm  has  this  prop¬ 
erty.  These  discrepancies  may  stem  from  the  methodology  used  by  Sengupta  and  Kay.  They 
report  that  in  approximately  one  percent  of  the  trials  performed  for  this  experiment  (i.e.,  in 
approximately  50  out  of  5000  trials),  their  Newton-Raphson  optimization  algorithm  did  not 
converge.  Whenever  convergence  was  not  obtained,  the  results  of  the  corresponding  trial 
were  discarded;  hence,  these  trials  are  not  reflected  in  the  statistics  presented  in  Table  2.1. 
In  contrast,  the  EMAX  algorithm  converged  in  all  of  the  5000  trials;  hence,  the  results 
of  all  trials  are  represented  in  the  table.  The  reduction  in  variance  realized  by  the  S-K 
algorithm  over  the  EMAX  algorithm  may  be  due  to  the  discarded  trials.  This  conjecture 
is  plausible  if,  on  those  occasions  when  the  Newton-Raphson  algorithm  did  not  converge, 
the  ML  parameter  estimates  were  relatively  far  from  the  true  parameter  values.  If  such  a 
correlation  exists  between  events,  then  it  is  precisely  the  estimates  that  are  never  obtained 
because  of  lack  of  convergence  that  distort  the  sample  variances  reported  by  Sengupta  and 
Kay. 
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2.3.2  Example  2:  Comparison  with  Previous  Work  (Part  II) 

Our  next  example  illustrates  that  the  EMAX  algorithm  performs  significantly  better  in 
certain  kinds  of  estimation  problems  than  the  algorithm  previously  proposed  by  Zhao  et 
al.  [232],  which  is  based  on  precisely  the  same  statistical  model  for  the  observed  data  as  that 
presented  in  Section  2.1.1.  The  algorithm  of  Zhao,  which  is  apparently  not  motivated  in  any 
respect  by  the  EM  principle,  is  similar  in  structure  to  the  EMAX  algorithm.  In  particular, 
both  of  these  iterative  algorithms  use  the  same  set  of  generalized  normal  equations  to  solve 
for  the  estimates  of  the  AR  parameters  when  given  the  values  of  the  mixture  parameters.  In 
addition,  at  the  beginning  of  each  iteration,  both  algorithms  use  the  resulting  AR  parameter 
estimates  to  inverse  filter  the  observation  sequence.  The  main  difference  lies  in  the  stage  of 
each  algorithm  that  estimates  the  pdf  mixture  parameters  from  the  sequence  of  residuals. 
As  discussed  in  Section  2.2,  the  EMAX  algorithm  uses  the  information  available  in  the 
residual  sequence  to  climb  the  likelihood  surface.  In  contrast,  Zhao  abandons  a  likelihood- 
based  approach  (citing  a  desire  to  avoid  the  degenerate  solutions  mentioned  earlier)  in  favor 
of  a  heuristic  clustering  algorithm. 

In  the  two-component  mixture  case,  the  clustering  algorithm  first  sorts  the  residual 
samples  in  ascending  order  and  then  seeks  out  the  best  point  at  which  to  divide  these  sorted 
samples  into  two  disjoint  sets.  The  optimum  point  is  defined  as  that  which  minimizes  the 
average  value  of  the  sample  variances  associated  with  these  two  sets.  Once  this  optimum 
point  is  found,  Zhao’s  estimates  of  the  means  and  variances  of  the  constituent  Gaussian 
densities  are  the  sample  means  and  sample  variances  associated  with  the  two  sets,  and  the 
estimate  of  the  unknown  weighting  coefficient  is  simply  the  fraction  of  samples  contained 
in  each  set  with  respect  to  the  total  number  of  residual  samples. 

We  have  observed  that  the  algorithm  of  Zhao  does  not  perform  well  when  the  constituent 
Gaussian  densities  in  the  driving-noise  pdf  have  equal  means.  In  this  example  we  demon¬ 
strate  that  in  such  a  case  the  performance  of  the  EMAX  algorithm  is  markedly  superior 
to  that  of  the  Zhao  algorithm.  In  particular,  we  considered  the  problem  of  estimating  the 
parameters  of  an  ARGMIX  process  whose  AR  coefficients  are  given  by 


(oj,  o2,  a3,  o4)  =  (-0.1000,  -0.2238,  -0.0844,  -0.0294).  (2.53) 

The  pdf  for  the  driving  noise  in  this  case  was  assumed  to  be  a  two-component  Gaussian- 
mixture  pdf  as  in  (2.49),  but  now  with  mixture  parameters  defined  by 

(pi,Mi,cn)  =  (0.6, 0.0, 1.0);  (2.54) 

(P2,l*2, 02)  =  (0.4,0.0,10.0).  (2.55) 


To  compare  the  performance  of  the  two  algorithms,  we  performed  a  total  of  500  trials. 
On  each  trial,  a  sequence  of  1000  data  points  was  generated  and  processed  with  the  EMAX 
algorithm  and  Zhao  algorithm.  The  sample  means,  variances,  and  mean  square  errors  of  the 
parameter  estimates  produced  by  the  two  algorithms  are  presented  in  Table  2.2.  We  note 
that  the  Zhao  algorithm  produces  strongly  biased  estimates  in  this  example.  In  addition,  we 
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Figure  2-3:  Power  spectral  density  of  fourth-order  AR  process  discussed  in  Example  2. 


True 

Value 

Sample 

Mean 

(emax) 

Sample 

Mean 

(ZHAO) 

Sample 

Variance 

(emax) 

Sample 

Variance 

(zhao) 

Sample 

MSE 

(emax) 

Sample 

MSE 

(zhao) 

Ratio 
of  MSE ’s 

<*1 

-0.1000 

-0.1000 

-0.1150 

4.975  x  10~5 

1.170  x  10_d 

4.965  x  10"3 

1.392  x  10-3 

28.03 

0-2 

-0.2238 

-0.2238 

-0.2390 

5.564  x  10~5 

1.198  x  HT3 

5.553  x  10~5 

1.427  x  10~3 

25.71 

-0.0844 

-0.0843 

-0.0983 

5.539  x  10~5 

1.165  x  10~3 

5.528  x  10" 5 

1.355  x  10~3 

24.52 

Q-4 

-0.0294 

-0.0289 

-0.0405 

5.010  x  10~5 

1.148  x  10"3 

5.029  x  10"s 

1.269  x  10" 3 

25.24 

Table  2.2:  Sample  means,  variances,  and  mean  square  error  (MSE)  values  for  parameter 
estimates  of  Example  2.  Entries  were  computed  using  results  of  500  trials  for  (i)  the  Zhao 
algorithm  and  (ii)  the  EMAX  algorithm.  Ratios  of  sample  MSE  values  (MSE  of  Zhao  to 
MSE  of  EMAX)  are  also  given. 
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note  that  the  mean  square  errors  associated  with  the  EMAX  algorithm  are  approximately 
25  times  smaller  than  those  associated  with  the  Zhao  algorithm.  Clearly,  contributions  to 
the  mean  square  error  for  Zhao’s  estimates  come  not  only  from  the  bias  term,  but  also  from 
the  high  variance  associated  with  her  estimator. 

The  difficulties  with  the  Zhao  algorithm  in  this  case  may  be  explained  by  its  inability  to 
obtain  good  mixture  parameter  estimates.  The  quality  of  the  mixture  parameter  estimates 
is  inherently  limited  because  the  clustering  algorithm  essentially  assigns  the  individual  den¬ 
sities  in  the  Gaussian  mixture  to  be  representatives  of  disjoint  portions  of  the  histogram 
of  the  residual  sequence.  Thus,  one  of  the  most  readily  observable  problems  with  the  ap¬ 
proach,  as  illustrated  in  Figure  2-4(a),  is  that  all  of  the  estimated  means  of  the  constituent 
densities  are  necessarily  distinct,  even  when  the  means  of  the  true  densities  are  identical. 
Figure  2-4(a)  shows  the  true  marginal  pdf  for  the  driving  noise  as  well  as  typical  estimates 
of  this  pdf  produced  by  the  Zhao  algorithm  on  separate  trials.  Observe  from  the  figure  that, 
for  about  half  of  the  trials,  the  pdf  estimate  produced  by  the  Zhao  algorithm  is  off-center 
to  the  positive  side  of  zero,  and  for  the  other  half  it  is  off-center  to  the  negative  side.  On 
each  trial,  the  estimated  Gaussian-mixture  pdf  is  dominated  by  a  single  component,  which 
attempts  to  model  most  of  the  histogram  of  the  residual  samples.  However,  the  result¬ 
ing  overall  estimate  is  always  off-center  because  the  smaller  of  the  two  components  in  the 
mixture  attempts  to  model  the  remaining  outliers,  which  are  either  much  greater  or  much 
less  than  zero.  In  contrast,  as  shown  in  Figure  2-4(b),  the  EMAX  algorithm  produces  pdf 
estimates  that  better  approximate  the  true  driving-noise  pdf. 

2.3.3  Example  3:  Autoregressive  Process  with  Laplacian  Drive 

In  many  applications,  we  would  like  to  obtain  ML  estimates  for  the  parameters  of  an 
AR  system,  but  the  ML  problem  is  ill-posed  because  the  marginal  pdf  characterizing  the 
driving  noise  is  unknown.  In  certain  cases,  however,  it  may  be  reasonable  to  assume  that 
the  true  marginal  pdf  is  accurately  modeled  by  a  Gaussian-mixture  pdf,  provided  that 
the  means,  standard  deviations,  and  weighting  coefficients  defining  the  mixture  are  chosen 
appropriately.  In  these  cases,  if  we  process  our  observations  with  the  EMAX  algorithm, 
then  we  might  expect  the  EMAX  algorithm  to  find  the  mixture  parameters  that  yield 
a  good  approximation  to  the  true  driving-noise  pdf  and  simultaneously  to  produce  good 
approximations  to  the  ML  estimates  for  the  AR  parameters.  With  the  present  example  we 
demonstrate  the  validity  of  this  approach  to  the  ML  estimation  problem. 

In  particular,  we  consider  the  parameter  estimation  problem  for  a  fifth-order  AR  process 
whose  AR  coefficients  are  given  by 

(o!,  o2,  o3,  o4,  o5)  =  (1.934,  -2.048, 1.072,  -0.340, 0.027).  (2.56) 

The  driving  noise  for  this  process  consists  of  i.i.d.  samples  distributed  according  to  a  Lapla¬ 
cian  pdf  defined  by 


— oo  <  w  <  oo, 


(2.57) 
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Figure  2-4:  True  marginal  pdf  (dashed  curve)  for  driving  process  of  Example  2  and  typical 
estimates  of  the  pdf  (solid  curves)  produced  by  (a)  the  algorithm  of  Zhao  et  al.  (20  estimates 
overlaid),  and  (b)  the  EMAX  algorithm  (20  estimates  overlaid). 


where  the  scale  parameter  (5  (which  is  related  to  the  standard  deviation  o  for  this  density 
by  cr  =  \/2/3)  was  put  at  /?  =  5.  A  plot  of  the  power  spectral  density  of  this  process  is 
shown  in  Figure  2-5. 

It  is  interesting  to  compare  the  performance  of  the  EMAX  algorithm  to  that  of  the  exact 
ML  estimates,  which  can  be  computed  in  this  case.  It  can  be  shown  [49]  that  if  the  samples 
of  the  driving  noise  for  an  AR  process  are  i.i.d.  and  Laplacian,  then  the  ML  estimate  for 
the  AR  parameter  vector  a  is  given  by  the  value  of  a'  that  minimizes  the  sum  of  absolute 
residuals  1 Vt  ~  yfa'|.  An  algorithm  for  finding  such  a  value  for  a'  was  proposed  by 

Schlossmacher  [167];  this  algorithm  is  based  on  the  method  of  iteratively  reweighted  least 
squares  and  is  therefore  easy  to  implement  on  a  computer. 

To  find  parameter  estimates  for  this  problem  with  the  EMAX  algorithm,  we  fixed  the 
number  of  Gaussian  densities  in  the  mixture  at  N  =  3  and  constrained  the  means  of  these 
constituent  densities  to  be  zero.  We  performed  a  total  of  500  trials.  On  each  trial,  a 
sequence  of  1000  data  points  was  generated  and  processed  with  the  EMAX  algorithm.  The 
sample  means  and  sample  mean  square  errors  of  the  parameter  estimates  produced  by  the 
EMAX  algorithm  are  presented  in  Table  2.3. 

Also  shown  in  Table  2.3  is  a  summary  of  the  sample  means  and  sample  mean  square 
errors  of  the  AR  parameter  estimates  given  by  two  other  algorithms:  (i)  the  forward- 
backward  least-squares  method,  and  (ii)  the  ML  algorithm  of  Schlossmacher.  Experimental 
results  shown  in  Table  2.3  confirm  our  expectation  that  the  ML-based  estimator  would 
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Figure  2-5:  Power  spectral  density  of  fifth-order  AR  process  discussed  in  Example  3. 


True 

Value 

Sample 

Mean 

(LS) 

Sample 

Mean 

(emax) 

Sample 

Mean 

(ML) 

Sample 

MSE 

(LS) 

Sample 

MSE 

(emax) 

Sample 

MSE 

(ML) 

dl 

1.934 

1.9311 

1.9323 

1.9328 

1.0751  x  10~3 

6.3040  x  10~4 

5.7711  x  10-4 

a2 

-2.048 

-2.0413 

-2.0447 

-2.0449 

5.1570  x  10-3 

2.8784  x  10"3 

2.7250  x  10~3 

dz 

1.072 

1.0647 

1.0685 

1.0697 

8.4001  x  10"3 

4.7769  x  10~3 

4.2887  x  10~3 

Q4 

-0.340 

-0.3358 

-0.3383 

-0.3390 

4.9714  x  10"3 

2.9190  x  10~3 

2.4228  x  10~3 

&5 

0.027 

0.0256 

0.0264 

0.0269 

1.0509  x  10"3 

6.2875  x  10~4 

5.3948  x  10-4 

Table  2.3:  Sample  means  and  sample  mean  square  error  (MSE)  values  for  parameter  esti¬ 
mates  of  Example  3.  Entries  were  computed  using  results  of  500  trials  for  (i)  the  standard 
forward-backward  least  squares  (LS)  method,  (ii)  the  EMAX  algorithm,  and  (iii)  the  ML 
estimation  algorithm  developed  by  Schlossmacher. 
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Figure  2-6:  True  Laplacian  marginal  pdf  (dashed  curve)  for  driving  process  of  Example  3 
and  a  typical  estimate  of  the  pdf  (solid  curve)  produced  by  the  EMAX  algorithm,  plotted 
using  (a)  linear-magnitude  scale  (with  horizontal  axis  spanning  ±3  standard  deviations), 
and  (b)  log-magnitude  scale  (with  horizontal  axis  spanning  ±15  standard  deviations). 


perform  better  than  the  EMAX  and  least-squares  methods  since  it  directly  exploits  the  fact 
that  the  driving  noise  is  i.i.d.  with  a  Laplacian  distribution.  Observe  from  the  table  that 
the  ratio  of  the  mean  square  error  of  the  least-squares  estimate  to  that  of  the  ML  estimate 
ranges  approximately  from  1.9  to  2.1.  The  ratio  of  the  mean  square  error  of  the  EMAX 
estimate  to  that  of  the  ML  estimate  ranges  approximately  from  1.1  to  1.2.  Thus,  in  this  case 
the  EMAX  algorithm  produces  estimates  that  are  much  closer  to  the  exact  ML  estimates 
than  the  least-squares  estimates. 

The  superior  performance  of  the  EMAX  algorithm  may  be  attributed  to  the  ability  of 
its  assumed  Gaussian-mixture  pdf  to  closely  approximate  the  Laplacian  pdf,  as  is  shown  for 
a  typical  case  in  Figure  2-6 (a).  It  is  clear  from  this  figure  that  the  approximation  is  very 
good  over  the  region  in  which  most  of  the  samples  of  the  driving  noise  reside.  However, 
since  the  number  of  Gaussian  densities  in  the  mixture  is  finite,  an  accurate  model  for  the 
Laplacian  density  may  be  obtained  only  over  a  finite  region  of  support.  Eventually,  the  tails 
of  the  Gaussian-mixture  pdf  become  bounded  by  a  function  of  the  form  ki  exp{—k2W2}  for 
appropriately  chosen  constants  ki  and  &2-  Indeed,  Figure  2-6(b)  reveals  this  phenomenon 
with  the  aid  of  a  log-magnitude  scale. 
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2.3.4  Example  4:  Blind  Equalization  in  Digital  Communications 

Our  final  example  is  an  application  in  digital  communications  that  has  been  adapted 
from  [149].  In  this  example,  we  demonstrate  that  the  EMAX  algorithm  can  be  used  suc¬ 
cessfully  in  problems  where  the  primary  goal  is  signal  reconstruction,  rather  than  parameter 
estimation.  In  particular,  we  consider  a  communication  system  that  uses  amplitude-shift 
keying  (ASK).  In  this  scheme,  the  transmitter  communicates  with  the  receiver  using  an 
L-symbol  alphabet  A  —  {A;}^,  whose  elements  we  take  to  be  real  numbers.  To  send  the 
&th  symbol  of  a  particular  message  sequence  {ut}  to  the  receiver,  the  transmitter  generates 
a  pulse  (having  fixed  shape)  and  modulates  this  pulse  with  the  amplitude  u The  pulse 
then  propagates  through  the  communication  medium,  which  we  assume  is  well  modeled  by 
an  LTI  system.  Finally,  the  receiver  processes  the  waveform  with  a  linear  filter  to  facilitate 
estimation  of  Uk- 

If  this  filtered  waveform  is  sampled  at  a  rate  of  one  sample  per  symbol,  then  the  overall 
communication  system — i.e.,  the  transmitter,  the  medium,  and  the  receiver — can  be  repre¬ 
sented  with  an  equivalent  discrete-time  LTI  system,  which  we  refer  to  as  the  discrete-time 
channel.  In  this  case,  the  sampled  output  is  the  convolution  of  the  transmitted  symbol 
sequence  {ut}  and  the  impulse  response  {ht}  that  characterizes  the  discrete-time  channel. 
If  the  impulse  response  {ht}  is  anything  but  a  shifted  and  scaled  unit  impulse,  then  each 
sample  of  the  output  sequence  will  contain  contributions  from  more  than  one  input  sym¬ 
bol,  i.e.,  there  will  be  intersymbol  interference  (ISI).  If  the  characteristics  of  the  medium 
are  known,  then  the  discrete-time  channel  is  also  known  and  the  receiver  can  compensate 
for  the  ISI  via  linear  equalization.  Often,  however,  the  characteristics  of  the  medium  are 
unknown,  and  the  impulse  response  of  the  discrete-time  channel  must  first  be  estimated  in 
order  to  compensate  for  the  ISI.  One  approach  for  accomplishing  this  is  for  the  transmit¬ 
ter  to  send  through  the  medium  a  training  sequence  that  is  known  to  the  receiver.  The 
receiver  can  then  identify  the  impulse  response  of  the  discrete-time  channel  from  the  out¬ 
put  sequence  and  apply  the  corresponding  inverse  filter.  However,  if  the  medium  is  rapidly 
changing,  then  this  procedure  must  be  performed  frequently,  and  the  effective  data  rate  will 
be  substantially  reduced.  An  alternative  approach  is  to  perform  blind  equalization — i.e.,  to 
estimate  the  impulse  response  of  the  discrete-time  channel  from  the  output  without  knowing 
the  input,  and  then  apply  the  appropriate  inverse  filter. 

We  consider  a  scenario  in  which  blind  equalization  must  be  performed  by  the  re¬ 
ceiver.  We  assume  an  ASK  modulation  scheme  that  uses  the  four-symbol  alphabet  A  = 
{—3,  —1, 1, 3}.  A  typical  200-point  input  sequence  to  the  discrete-time  channel,  which  was 
generated  randomly  using  the  alphabet  A,  is  shown  in  Figure  2-7(a).  We  assume  that  the 
discrete-time  channel  has  a  finite  impulse  response  {ht}  with  ^-transform 

H(z)  =  1.0  -  0.65z_1  +  0.06z-2  +  0.41z”3.  (2.58) 

Figure  2-7(b)  shows  the  received  sequence,  which  is  the  convolution  of  the  input  sequence 
shown  in  Figure  2-7(a)  and  the  impulse  response  It  is  evident  from  this  figure  that 
detection  of  the  input  symbols  from  the  received  sequence  would  be  difficult  without  further 
processing. 
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Our  blind  equalization  approach  consists  of  channel  estimation  followed  by  filtering  with 
the  inverse  of  the  estimated  channel.  We  compare  three  methods  for  estimating  the  im¬ 
pulse  response  of  the  channel  from  the  output  sequence  shown  in  Figure  2-7(b):  (i)  the 
forward-backward  least-squares  method,  (ii)  the  fourth-order  cumulant-based  technique  of 
Giannakis  and  Mendel  [65],  and  (iii)  the  EMAX  algorithm.  We  configured  all  three  algo¬ 
rithms  to  estimate  18  AR  coefficients.  Such  a  configuration  assumes  that  the  discrete-time 
channel  inverse  may  be  accurately  modeled  with  a  system  having  18  zeroes  and  no  poles. 
We  further  configured  the  EMAX  algorithm  to  estimate  the  means  and  variances  of  four 
constituent  Gaussian  densities.  Figures  2-7(c)-(e)  show  the  restored  input  sequences  gen¬ 
erated,  respectively,  by  (i)  the  least-squares  method,  (ii)  the  Giannakis-Mendel  algorithm, 
and  (iii)  the  EMAX  algorithm.  It  is  clear  from  Figures  2-7(c)-(e)  that  the  recovered  se¬ 
quence  values  produced  by  the  EMAX  algorithm  are  much  more  tightly  distributed  around 
the  four  true  symbol  values  than  either  the  recovered  sequence  values  produced  by  the  least 
squares  method  or  those  produced  by  the  cumulant-based  method.  Hence,  in  this  case  we 
would  expect  superior  detection  performance  using  the  EMAX  algorithm. 


2.4  An  Alternative  Version  of  the  EMAX  Algorithm 

In  its  original  form,  the  EMAX  algorithm  is  capable  of  solving  an  extremely  broad  class  of 
source  identification  problems  because  the  ARGMIX  model  on  which  it  is  based  offers  many 
degrees  of  freedom  in  signal  representation.  As  we  have  seen,  the  ARGMIX  model  consists 
of  two  main  components  for  describing  an  unknown  signal:  (i)  a  pdf  that  characterizes  the 
statistical  behavior  of  each  sample  of  the  driving  noise;  and  (ii)  an  autoregressive  linear 
time-invariant  system  that  induces  temporal  dependency  among  the  samples  of  the  driving 
noise.  The  power  of  the  EMAX  algorithm  over  conventional  least-squares  techniques  clearly 
derives  from  the  first  of  these  components,  i.e.,  from  the  flexibility  of  allowing  the  driving- 
noise  pdf  to  be  unknown  and  to  have  an  arbitrarily  complicated  shape. 

There  are  many  situations  arising  in  practice,  however,  in  which  we  do  not  require  such 
flexibility  in  a  parameter  estimation  algorithm;  in  fact,  in  certain  situations  we  would  gladly 
sacrifice  the  flexibility  of  the  original  algorithm  in  exchange  for  a  reduction  in  its  computa¬ 
tional  complexity.  In  this  section,  we  consider  a  useful  restriction  of  the  original  estimation 
problem  that  affords  such  a  trade-off.  In  particular,  we  focus  on  the  case  in  which  the  driv¬ 
ing  noise  is  characterized  by  a  fixed  pdf  whose  functional  form  is  precisely  known  except  for 
a  scale  factor  (i.e.,  a  positive  real  number  that  indicates  the  degree  of  dispersion  in  the  data 
distribution).  In  such  a  case,  even  though  much  prior  information  about  the  pdf  is  available, 
implementing  an  exact  ML  procedure  directly  may  still  be  extremely  difficult  because  of 
the  complicated  mathematical  form  of  the  pdf.  On  the  other  hand,  using  too  simple  an 
approximation  to  the  exact  ML  procedure  (for  example,  using  the  classical  approximation 
based  on  the  Gaussian-noise  assumption)  may  lead  to  unacceptably  poor  results.  For  a 
situation  of  this  kind,  the  EMAX  algorithm  can  easily  be  reconfigured  so  that  it  provides 
a  sufficiently  sophisticated,  yet  also  quite  efficient  and  convenient,  method  of  obtaining  a 
good  approximation  to  the  ML  solution. 

To  explore  this  idea  further,  let  us  suppose  that  the  true  density  fw(-)  for  the  driving 
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Figure  2-7:  Illustration  of  channel  equalization  considered  in  Example  4:  (a)  original  symbol 
sequence;  (b)  received  sequence;  (c)  restored  sequence  using  standard  forward-backward 
least-squares  method;  (d)  restored  sequence  using  fourth-order  cumulant-based  Giannakis- 
Mendel  algorithm;  (e)  restored  sequence  using  EMAX  algorithm  assuming  four-component 
Gaussian-mixture  pdf. 
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noise  belongs  to  a  parameterized  family  of  densities  that  is  invariant  with  respect  to  scale 
(i.e.,  if  the  pdf  for  the  random  variable  W  is  in  the  family,  then  the  pdf  for  Wj (3  is  also  in 
the  family  for  any  positive  real  number  0).  For  example,  the  zero- mean  Laplacian  family 
used  in  an  earlier  example  is  scale-invariant,  as  is  the  Gaussian  family,  and  hence  also 
the  Gaussian-mixture  family  for  any  fixed  number  M  of  mixture  components.  To  indicate 
explicitly  that  the  driving-noise  pdf  belongs  to  a  scale-invariant  family  of  densities,  we  shall 
write  it  as  0)  in  the  remainder  of  this  section,  with  the  understanding  that 

fw{w,0)  =  fw//3(w),  -oo  <  w  <  oo.  (2.59) 

For  convenience,  we  shall  also  assume  that  fw{ •;  0)  is  continuous  at  all  but  a  finite  number 
of  points  on  the  real  line  and  contains  no  impulses. 

By  allowing  the  scale  factor  on  an  arbitrary  but  known  driving-noise  pdf  to  be  a  free 
parameter,  we  are  essentially  creating  a  generalization  of  the  classical  AR  parameter  esti¬ 
mation  problem  in  which  the  zero-mean  driving  noise  is  assumed  to  be  Gaussian,  but  has 
an  unknown  standard  deviation  that  must  also  be  estimated.  The  generalization  follows 
immediately  from  the  fact  that  the  quantity  1/0  serves  as  a  measure  of  the  dispersion  of 
the  distribution,  since  it  is  related  to  the  standard  deviation  of  the  distribution  through  an 
affine  transformation.  To  make  this  interpretation  as  direct  as  possible,  let  us  assume  in 
the  sequel  that  the  driving  noise  is  indeed  zero-mean  and  that  the  parameter  value  0=1 
corresponds  to  a  standard  deviation  of  unity,  so  that  1/0  is  exactly  equal  to  the  standard 
deviation  for  all  0  >  0. 

Because  the  functional  form  of  the  pdf  fw(-;0)  is  assumed  known  for  any  value  of  0, 
a  good  approximation  to  this  pdf  for  a  particular  value  of  0  (say,  0  =  1)  can  be  designed 
off-line,  using  a  Gaussian-mixture  model,  before  any  data  are  observed.  Such  a  procedure 
yields  an  approximation  of  the  form 


fw(w,0) 


/»=i 


M 

i=l 


(2.60) 


where  the  number  of  mixture  components  M  is  also  chosen  as  part  of  the  design  process. 
Once  a  suitable  Gaussian-mixture  approximation  has  been  obtained  for  the  case  where 
0  =  1,  it  can  then  be  adjusted  to  approximate  any  other  element  in  original  family  of 
densities  by  performing  a  simple  transformation  on  the  mixture  parameters.  In  particular, 
by  using  the  scale-invariance  property  of  the  original  family  of  densities,  we  can  write 

fw{w\0)  —  fw/piw)  (2-61) 

=  0fw(0w;  1)  (2.62) 

M 

~  0PiN(0w,  Pi,  cn)  (2.63) 

2=1 
M 

=  ^2  PiN{w\  Pi/0,  Oi/ 1 3), 

2—1 


(2.64) 
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where  the  last  step  follows  from  the  fact  that,  for  each  Gaussian  component  Af(-\pi,ol) 
included  in  the  mixture,  we  may  write 


P 


OKHn  x  I'  f  Ww-Pi) 2\ 

- ) 

[t»  ~  iPi/P)? 


exp 


=  J\T(w;(ii/P,ai/l3). 


f  [w-(pj/(3)]2\ 

l  ‘KoilP?  I 


(2.65) 

(2.66) 

(2.67) 


We  conclude  from  the  series  of  equalities  (2.61)-(2.64)  that  an  approximation  for  any  pdf  in 
the  original  parameterized  family  can  easily  be  generated  from  the  initial  approximation  by 
appropriately  scaling  the  means  {m}^  and  standard  deviations  {cqjfij  of  the  Gaussian- 
mixture  components.  The  weighting  coefficients  {pi}^  would  remain  unchanged  from  their 
initial  values. 


Having  made  these  observations  about  Gaussian-mixture  approximations,  let  us  now 
derive  a  new  version  of  the  EMAX  algorithm  based  on  the  assumption  that  the  true  driving- 
noise  pdf  is  itself  a  Gaussian  mixture  which  is  given  by 

M 

fw{w;/3 )  =  '^2piAf(w;pi/l3,<Ti//3),  (2.68) 

i—  1 

where  the  values  of  the  parameters  M,  {0i}£fi  >  and  {Pi)iL\  are  again  precisely 

known.  The  goal  of  this  alternative  version  of  the  EMAX  algorithm  will  be  to  generate 
joint  QML  estimates  for  the  AR  parameter  vector  a  and  for  the  scale  factor  (3  associated 
with  the  driving-noise  pdf.  To  reflect  this  restriction  in  the  new  estimation  problem,  we 
re-define  the  parameter  vector  ^  as 


*  =  (/?,  a).  (2.69) 

The  new  algorithm  for  estimating  the  value  of  \fr  is  now  specified  by  the  iterative  formula 
(P{s+1),  a<s+1))  =  arg  max E  {log /y|#,y0 (y  I^j  yo;  P'  ■>  a7)  |  Y  =  y,  Y0  =  y0;*(s)} ,  (2.70) 

which  is  analogous  to  the  original  EM  formula  given  in  (2.28).  (Observe  that  the  complement 
of  the  original  formula  given  in  (2.27)  is  no  longer  needed,  since  the  weighting  coefficients  are 
now  considered  fixed  and  known.)  To  obtain  a  more  explicit  expression  of  the  function  being 
maximized  in  (2.70),  we  can  borrow  the  previously  derived  formula  in  (2.38)  and  replace 
the  original  variables  and  cr'  with  their  new  counterparts  pj //?'  and  Oj/P',  respectively; 
this  substitution  yields  the  modified  formula 

E  {log  /Y|*,Yo(y|^  yo;  a',  P')  I  Y  =  y,Y0  =  y0;*(s)}  = 
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N 

—  log  27T 


JV-1  M 


N—l  M 


[yt 


T  1 

y*  a 


EE^(*w)iog|- 

t=0  j=l  M  t=0  j'=l  \  3!  v  > 


bi/P)]' 


(2.71) 


As  a  reminder  that  the  search  for  a  maximum  is  to  be  performed  over  only  the  two  variables 
0'  and  a',  we  introduce  a  new  objective  function  if  (■)  defined  by 


N- 1  M  N-l  M  To,,  T  l2 

aw,  a')  =  -  EE^irt^S- v  M,'J  ,  (2-72) 

+ — n  -■ — t  ^  + — n  - — i 


t=0  j= 1 


t=0  j=l 


which  is  derived  from  the  expression  on  the  right-hand  side  of  (2.71)  by  dropping  the  initial 
constant  term  and  substituting  a  somewhat  more  convenient  (yet  algebraically  equivalent) 
form  for  the  final  term.  By  taking  a  partial  derivative  with  respect  to  the  variable  a'  and 
setting  the  result  equal  to  zero,  we  have  that  the  unique  maximum  occurs  at  the  vector 
location 


arg  max  H(0',a!) 


N-l  M 


X.  X ,  — ^ — ytyt 


t=  0  j-l 


-1 


t= 0  j=l  3 


(2.73) 


If  we  now  take  a  partial  derivative  of  H(-)  with  respect  to  the  scale  variable  0',  we  obtain 

dH  _  1 
80'  ~  0 ' 


^  i  E  Ef’o(*W>  -  E  E  ^  ^  ^  -  y‘ a0 .  (2-74, 


t= o  j=l 


t= 0  j=l 


Observe  that,  for  a  fixed  value  of  t,  the  M  terms  {Pt,j{^^)}jLi  necessarily  sum  to  unity 
because  they  form  a  probability  mass  function;  it  follows  that  the  value  of  the  double 
summation  in  the  first  term  above  must  be  equal  to  N.  If  we  first  make  this  substitution  and 
then  set  the  entire  expression  equal  to  zero,  we  obtain  (after  some  algebraic  manipulation) 
the  quadratic  equation 


EE^fc-7M! 

0'2~ 

[eEPm<* T)w(y.  W»')| 

t=0  j-l  3 

t= 0  j=l  °3 

(2.75) 


For  notational  convenience,  we  rewrite  this  equation  in  the  more  concise  form 

A0'2  -  B0!  -  N  =  0, 


(2.76) 


where  the  definitions  of  A  and  B  are  readily  inferred  from  (2.75).  Next,  by  applying  the 


50 


Chapter  2.  Using  the  ARGMIX  Signal  Model  for  Non-Gaussian  Inference 


quadratic  formula,  we  obtain  the  two  possible  solutions 

{B+\/B2+AN  A 

B-VB*+4NA  (2.77) 

2A 

Note  that  since  A  is  always  positive,  the  expression  under  the  square  root  sign  must  also  be 
positive.  Moreover,  this  same  expression  has  a  value  whose  magnitude  always  exceeds  the 
magnitude  of  B.  Thus,  while  we  cannot  know  in  advance  whether  B  itself  will  be  positive 
or  negative,  we  can  say  with  certainty  that  the  first  solution  given  in  (2.77)  will  always 
be  positive  and  that  the  second  will  always  be  negative.  Moreover,  it  can  be  shown  (by 
taking  a  second  derivative  of  H(-)  with  respect  to  0')  that  either  of  these  two  solutions  is 
a  local  maximum  of  H(-)  for  a  fixed  value  of  a.  Because  we  seek  a  positive  value  of  0'  that 
maximizes  H(-),  we  know  that  the  unique  solution  must  be 


arg max  H(0',sl)  = - „  •/'  ,  . - 

«■>»  2  ES'  E"  1 -  yfa')2 


Eio1  E"i  -  yf  a')l 2  +  4  n  [e,^1  £ " ,  _  yjyp 


2Ew  E,"ifi^l(y,-y?-a')2 


(2.78) 


Because  (2.78)  and  (2.73)  are  highly  coupled  nonlinear  equations,  we  once  again  resort  to 
the  technique  of  coordinate  ascent  to  evaluate  the  optimal  solution.  Using  the  equations  of 
optimality,  we  can  express  the  coordinate-ascent  algorithm  as  follows: 


INITIALIZATION: 

/?(°)  =  0 ^ 
a<°>  =  a« 

ITERATION: 


(2.79) 

(2.80) 


E.E/'^y^ty-yraW) 

2  EtE.-^^fa* 


+ 


Ej  -  yfa(O)  +  4iv  E<  Ej  Ft’^TW)  (yt  -  yT^{i))2 


(2.81) 
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a(t+i)  = 

1^2-.  ytyt 

-1 

T,t.P,AT\y>  M//5(‘+1,)y. 

> 

H 

II 

o 

II 

t= o  j= i 

1 

(2.82) 

After  this  recursion  has  been  performed  for  i  =  0, 1,--  -  ,J  —  1  (where  J  is  chosen  to 
be  a  sufficiently  large  integer),  our  parameter  updates  are  then  defined  by  a^s+1^  = 
and  =  ( 3(J\  and  the  entire  process  is  subsequently  repeated.  In  summary,  then,  a 

single  iteration  of  this  modified  EMAX  algorithm  consists  of  first  computing  the  posterior 
probabilities  { Ptj( \Er^)}  and  then  iterating  (2.81)  and  (2.82)  until  convergence. 

Recall  that  we  must  again  choose  values  for  and  to  initialize  our  new  algorithm. 
A  logical  choice  for  a^0^  is,  as  before,  the  parameter  vector  estimate  obtained  by  applying  the 
Yule- Walker  equations  to  the  original  observed  sequence.  Moreover,  since  the  parameter  /3 
is  inversely  proportional  to  the  standard  deviation  of  the  driving-noise  pdf,  a  logical  choice 
for  (3 is  the  reciprocal  of  the  sample  standard  deviation  associated  with  the  residual 
sequence  w(°),  which  can  readily  be  obtained  after  applying  the  approximate  inverse  filter 
1  —  af^z-i  to  the  sequence  of  observations. 

2.5  ARGMIX  Signal  Estimation  in  Additive  Noise 

Up  to  this  point,  we  have  considered  only  the  problem  of  source  identification  as  it  applies  to 
an  unknown  ARGMIX  process.  By  developing  a  simple  iterative  techniques  for  solving  this 
problem,  we  have  demonstrated  that  good  parameter  estimates  for  ARGMIX  processes  can 
be  generated  by  searching  for  local  maxima  of  the  likelihood  surface.  In  fact,  the  techniques 
we  developed  in  earlier  sections  constitute  an  important  extension  to  the  existing  collection 
of  classical  methods,  which  were  designed  to  solve  the  less  complicated  source  identification 
problem  in  which  the  unknown  process  is  assumed  to  be  autoregressive  but  purely  Gaussian. 

The  degree  of  success  we  were  able  to  achieve  in  the  source  identification  problem  now 
leads  us  to  question  whether  similar  progress  might  be  made  in  the  equally  important 
signal  estimation  problem.  In  this  section,  we  therefore  turn  our  attention  to  the  problem 
of  optimally  filtering  an  ARGMIX  process  that  has  been  corrupted  by  independent  additive 
noise,  under  the  assumption  that  we  are  given  the  true  parameter  values  for  both  signal 
and  noise.  We  soon  discover,  however,  that  the  ARGMIX  signal  model  is  not  well  suited  to 
the  development  of  filtering  or  smoothing  techniques  that  axe  both  computationally  efficient 
and  globally  optimal.  Indeed,  it  appears  that  any  algorithm  designed  to  produce  an  optimal 
estimate  of  an  ARGMIX  signal  in  Gaussian  noise  necessarily  incurs  a  computational  cost 
that  grows  exponentially  as  a  function  of  the  length  of  the  observed  sequence. 

After  demonstrating  the  inherent  algorithm  complexity  associated  with  the  ARGMIX 
signal  model,  we  discuss  several  alternative  solutions  to  the  signal  estimation  problem;  these 
solutions  are  reasonable  approximations  which  are  much  less  computationally  expensive,  but 
which  naturally  lack  the  property  of  global  optimality.  Our  discussion  of  these  alternative 
methods  will  actually  lead  us  to  the  creation  of  a  new  signal  model  which  is  altogether 
different  from  the  original  ARGMIX  signal  model,  but  which  appears  to  be  more  practical 
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and  more  general  than  the  ARGMIX  model.  It  is  the  analysis  and  development  of  this  new 
model  that  will  occupy  us  for  the  remaining  portion  of  the  thesis. 

To  demonstrate  the  difficulty  we  encounter  when  attempting  to  estimate  an  ARGMIX 
process  that  has  been  corrupted  by  additive  noise,  let  us  now  consider  a  very  simple  yet 
illustrative  problem  of  this  kind.  In  particular,  suppose  we  have  a  random  signal  {Yt}  that 
has  been  generated  according  to  the  first-order  AR  difference  equation 

Yt  =  dYt- 1  +  Wt,  (2.83) 

where  a  is  the  single  real-valued  AR  coefficient  of  the  process  and  {Wt}  is  a  sequence  of 
i.i.d.  random  variables,  each  distributed  according  to  a  fixed  Gaussian-mixture  pdf,  which 
we  denote  by  fw{-)-  For  simplicity,  we  shall  assume  that  this  driving-noise  pdf  fw(-) 
consists  of  only  two  zero-mean  Gaussian  components,  so  that  we  may  write  it  as 

fw{w)  =  pN{w,Q,cr{)  +  (1  -  p)N(w;0,o2).  (2.84) 

To  insure  that  the  driving  noise  {Wt }  is  indeed  non-Gaussian  (i.e.,  to  preclude  an  assignment 
of  ARGMIX  parameter  values  that  would  yield  a  purely  Gaussian  signal),  we  impose  the 
further  conditions  0  <  p  <  1  and  <72  #  ai  on  the  model  parameters.  We  will  assume  for 
convenience,  however,  that  the  autoregression  in  (2.83)  is  initialized  randomly  at  time  t  =  0 
according  to  a  Gaussian  probability  law,  so  that  the  signal  variable  Yq  is  characterized  by 
the  pdf  Af  (■;(),  cry)- 

In  contrast  to  the  setup  for  the  source  identification  problem,  a  clean  observation  of  the 
signal  {Yt}  is  not  available  in  this  case.  Instead,  we  may  observe  the  signal  only  after  it 
has  been  corrupted  by  additive  white  Gaussian  noise;  hence,  each  element  of  the  observed 
sequence  {Zt}  has  the  form 


Zt  =  Yt  +  Vu  (2.85) 

where  {Vt}  is  a  sequence  of  i.i.d.  random  variables,  each  having  a  pdf  jfv(-)  defined  by 

fv(v)  =  M{v\  0,  <7y).  (2.86) 

All  random  variables  contained  in  the  sequences  (V)}  and  {Wt}  axe  understood  to  be  mu¬ 
tually  independent. 

It  should  be  clear  from  the  description  given  above  that  our  observation  model  is  com¬ 
pletely  characterized  by  the  parameter  vector 

^  =  (a,p,<7i,<72,0Y,0v).  (2.87) 

Suppose,  then,  that  we  know  the  true  value  of  each  element  of  \tr,  and  furthermore  that  we 
have  been  furnished  with  realizations  of  the  first  N  samples  of  the  sequence  {Zt}.  Then, 
given  that  we  have  observed  the  event  Zo:/v-i  =  zo:JV-i,  our  objective  is  to  produce  an 
MMSE  estimate  of  the  underlying  signal  realization  yo-.N-i-  It  is  well  known  that  the 
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desired  estimate  yo:Ar-i  is  simply  the  conditional  mean  vector  given  by 

yO:JV-l  =  E{Yo:N-l\^iQ:N-l  =  Z0:iV-i;  ■  (2.88) 

To  understand  why  the  computation  of  this  optimal  estimate  becomes  so  complex  as  the 
observation  length  N  gets  large,  let  us  now  explore  the  structure  of  the  estimate  for  various 
values  of  N.  Our  approach  will  be  to  build  up  the  solution  in  a  series  of  steps,  starting  with 
the  simplest  case  in  which  N  =  1  and  progressively  working  toward  the  case  in  which  N  is 
allowed  to  be  arbitrarily  large. 

First  suppose  that  N  =  1.  By  using  our  assumption  about  the  initialization  of  the 
autoregressive  formula  in  (2.83),  together  with  the  above  description  of  the  additive  noise, 
we  know  that  the  observed  variable  Zq  is  constructed  as  a  superposition  of  the  independent 
zero-mean  Gaussian  random  variables  Yo  and  Vo,  and  is  therefore  itself  a  zero- mean  Gaussian 
random  variable.  Therefore,  the  value  of  the  optimal  estimate  in  this  case  is  given  by  the 
classical  linear  formula 


Vo  = 


<7 


Y _ „ 

+  °l0' 


(2.89) 


Note  that  the  signal  estimate  has  an  extremely  simple  form  when  N  =  1,  and  that  this 
estimate  can  be  evaluated  by  performing  only  a  single  multiplication,  provided  that  the 
leading  scale  factor  is  computed  off-line  before  the  realization  zq  is  received. 

Now  consider  the  case  in  which  N  =  2.  In  this  case,  the  observed  vector  Zoa  no  longer 
possesses  a  Gaussian  pdf,  since  one  of  its  two  independent  components  —  namely  the  signal 
vector  Yo;i  —  is  not  Gaussian.  To  verify  this  latter  assertion,  we  need  only  observe  that  if 
the  vector  You  were  Gaussian,  then  the  variable  Y\,  when  conditioned  on  the  event  To  =  i/q, 
would  also  be  Gaussian.  From  our  model  assumptions,  however,  we  know  that  this  cannot 
be  true,  because  under  such  conditioning,  Y\  consists  of  the  constant  ayo  plus  a  random 
innovation  that  is  distributed  according  to  a  two-component  Gaussian-mixture  probability 
law.  This  does  not  necessarily  imply  that  we  can  no  longer  use  the  classical  techniques  from 
linear-Gaussian  estimation  theory  to  construct  an  optimal  estimate.  On  the  contrary,  we  can 
still  apply  such  techniques  once  we  are  able  to  decompose  the  new  non-Gaussian  estimation 
problem  into  a  collection  of  separate  Gaussian  estimation  problems.  This  point  of  view 
will  allow  us  to  apply  a  conventional  linear  processor  in  each  of  the  individual  Gaussian 
problems  and  then  combine  the  resulting  estimates  using  a  special  type  of  weighted  average. 

The  key  to  decomposing  the  original  non-Gaussian  problem  into  purely  Gaussian  com¬ 
ponents  is  to  condition  the  original  problem  on  the  outcome  of  the  unobservable  event 
=  <f> i,  where  the  random  variable  is  (as  defined  in  our  earlier  development)  the  pdf- 
selection  variable  corresponding  to  time  t  =  1.  Such  conditioning  allows  us  to  assume  that 
we  know  which  of  the  two  Gaussian  densities  in  the  mixture  gave  rise  to  the  driving-noise 
sample  at  time  t  =  1.  For  example,  if  we  know  with  certainty  that  3>i  =  1,  then  we  may 
conclude  that  the  conditional  driving-noise  pdf  must  be  A f{-\  0,  cri).  Under  this  assumption, 
it  now  follows  that  the  variable  Y\  is  Gaussian  when  conditioned  on  the  event  To  =  yo\ 
hence,  the  entire  signal  vector  You  is  also  Gaussian,  as  is  the  observed  vector  Zou- 

Because  we  must  now  account  for  the  two  possibilities  3>i  =  1  and  5>i  =  2,  we  will 
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need  to  further  manipulate  the  expression  in  (2.88)  in  order  to  reduce  it  to  simplest  terms. 
Specifically,  we  now  express  the  optimal  estimator  as 


y<):l  —  E  {Yo:l|Zo:l  =  Zoa;  ’$'}  (2.90) 

=  /y/Yo;i|Zo:i(y|Zo:i  =  zo:i;^)dy  (2.91) 

f  2 

=  y£/vo:,  ,4>i|z0;i(y^i  =  ^|Zo:i  =  zo:i;^)dy  (2.92) 

J  <t>= 1 

r  ^  ^ 

=  /  1  =^|Z0:1  =ZO:i;,*'}/Yo:i|Zo:1>$1(y|Zo:l  =  *0:1,  $1  =  *)dy  (2.93) 

J  0=  1 

=  EPr-^  =  =  [  y/Yo:i|Zo:l^l(y|Zo:l  =  *0:1, $1  =  <P',^)dy  (2.94) 

4>=  1  J 

2 

=  EPr{$1  =  ^lz0:l  =  ZO:l}-E{Y0;l|Zo:l  =  Z():l,  $1  =  ’*'} 


<f>=l 

2 

=  E  Pr^i  =  ^IZ0:1  =  *0:i;  *}Cy(<1>)(Cy(4>)  +  vlirw.i 
0=1 
2 

=  EPr($l  =  <£IZ0:1  =  ZOU^JCy^C^^^ZO:! 

0=1 


(2.95) 

(2.96) 

(2.97) 


where  I  is  the  identity  matrix,  Oyl  is  the  covariance  matrix  of  the  Gaussian  noise  vector 
V 0;i ,  and  C y{4>)  and  Cz(<fi)  are,  respectively,  the  conditional  covariance  matrices  of  the 
Gaussian  signal  vector  You  and  the  Gaussian  observation  vector  Zou  given  that  =  <f>. 

One  of  the  most  immediately  obvious  properties  of  the  estimate  given  in  (2.97)  is  that 
it  is  actually  a  weighted  average  of  two  elemental  linear  estimates,  each  accounting  for  a 
unique  choice  of  the  true  Gaussian  pdf  of  the  driving- noise  sample  at  time  t  =  1.  The  overall 
estimate  does  not  inherit  the  property  of  linearity,  however,  because  the  weighting  coefficient 
for  each  term  in  the  average  is  a  nonlinear  function  of  the  observed  data  zou-  Nonetheless, 
this  weighting  coefficient  Pr{$i  =  0|ZO:i  =  z0:i:  SF}  can  still  be  easily  evaluated  using 
Bayes’  rule,  since  the  parameter  vector  <F  is  precisely  known.  For  example,  in  the  case 
where  4>i  =  1,  we  can  write 

Pr{4>i  =  1  |Zo:l  =  *0:1 » 

_  Pr{$i  =  1;  ’^r}/Zo.1|0l(zO;i|<&i  =  1;¥) 

“  =  0;*}/zO!l|*1(»O:ll*l  =  <£;*) 

_  _ |C  2  (^X  )  j »  /  2  eXP  _ 

|Cz(l)|l/2  eXP  {  —  5Z0:1^  (1)Z0:1  }  +  |Cz(2)>|1/2  eXP  {  — 5Z0;1  C^1(2)z0:l} 


(2.98) 

(2.99) 
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where  we  have  used  the  explicit  form  for  a  bivariate  zero-mean  Gaussian  density. 

By  comparing  (2.89)  and  (2.97),  we  see  that  the  optimal  estimate  derived  for  the  case 
N  =  2  requires  considerably  more  computation  than  does  the  estimate  for  the  case  JV  =  1. 
Of  course,  part  of  this  increase  in  computational  cost  —  for  example,  the  evaluation  of  a 
matrix- vector  product  rather  than  a  product  of  two  scalars  —  is  a  direct  consequence  of  the 
increase  in  the  length  of  the  observation;  this  portion  of  the  additional  cost  would  be  incurred 
even  in  a  purely  Gaussian  estimation  problem.  The  remaining  amount  of  computation  is  our 
primary  concern,  for  this  is  the  amount  we  incur  solely  because  of  the  non-Gaussian  nature 
of  problem.  Clearly,  the  same  kinds  of  matrix-vector  operations  that  would  be  required  to 
form  an  estimate  in  the  purely  Gaussian  case  must  now  be  performed  two  times  (i.e.,  for  two 
distinct  contingencies)  to  obtain  the  final  estimate  in  (2.97);  moreover,  for  each  of  the  two 
elemental  estimates  computed,  an  associated  weight  factor  must  also  be  computed.  Hence, 
when  JV  =  2,  the  amount  of  computation  that  is  needed  to  generate  an  optimal  estimate  in 
the  non-Gaussian  problem  is  at  least  twice  the  amount  required  in  the  analogous  Gaussian 
problem. 

As  we  might  expect,  the  computational  expense  doubles  yet  again  when  JV  =  3,  because 
in  this  case  we  must  account  for  each  of  the  four  possible  events  #1,2  =  (1,1),  <&i:2  = 
(1,2),  #i;2  =  (2,1),  and  $1,2  =  (2,2)  in  order  to  reduce  the  problem  to  a  collection  of 
familiar  Gaussian  sub-problems.  The  exponential  growth  in  the  number  of  possible  pdf- 
selection  sequences  (and  hence  the  amount  of  computation)  continues  as  the  observation 
length  JV  increases;  this  phenomenon  is  depicted  in  Figure  2-8.  Indeed,  when  JV  is  allowed 
to  be  arbitrarily  large,  we  must  account  for  every  possible  event  of  the  form  = 

•  •  •  ,  4>n-i)i  of  which  there  are  2jV~1  in  all.  For  this  general  case,  the  expression  for 
the  estimate  yo:JV-i  is  given  by 

yo:,/v-i  = 

2  2  2 

EE'"  £  =  ^|Z0:7V_l  =  ZO:N-r,  *}Cy  (</>)  (C  Y{<t>)  +  ^i)"' 

<t>2=l  0JV-1  =  1 

(2.100) 

which  is  seen  to  be  a  direct  extension  of  the  expression  given  in  (2.97).  The  above  estimate  is 
clearly  a  weighted  average  of  2jV“1  elemental  estimates,  each  tailored  to  a  unique  realization 
of  the  state  sequence 

It  should  now  be  evident  that  the  amount  of  computation  involved  in  evaluating  the 
optimal  estimate  in  (2.100)  will  be  prohibitive  even  when  JV  is  only  moderately  large.  Specif¬ 
ically,  evaluating  yo-.N-i  would  take  more  than  2iV_1  times  as  many  arithmetic  operations 
than  would  evaluating  an  estimate  of  the  same  length  in  the  case  where  the  signal  and  noise 
vectors  are  independent  and  purely  Gaussian.  We  further  note  that,  although  there  exist 
alternative  methods  for  computing  the  exact  value  of  the  optimal  estimate  in  (2.100),  it 
appears  that  each  of  these  methods  incurs  a  computational  cost  that  grows  exponentially 
as  a  function  of  the  observation  length  JV. 

For  example,  one  such  alternative  technique  would  be  to  optimally  combine  the  output 
signals  produced  by  a  bank  of  Kalman  smoothers  operating  in  parallel  on  the  observation 


56 


Chapter  2.  Using  the  ARGMIX  Signal  Model  for  Non-Gaussian  Inference 


Figure  2-8:  Binary  tree  diagram  depicting  exponential  growth  in  the  number  of  possible 
pdf-selection  sequences  that  could  be  realized  from  time  0  up  to  time  t.  Vertical  dashed 
lines  correspond  to  fixed  time  indices.  Each  potential  value  of  the  underlying  state  sequence 
is  shown  in  parentheses  next  to  its  associated  node  on  the  tree. 


zo;  j\r_ i .  But  it  is  well  known  that  the  classical  Kalman  smoother  is  able  to  produce  a  globally 
optimal  estimate  of  a  signal  in  additive  noise  only  if  the  signal  and  noise  are  jointly  Gaussian 
and  all  distributional  parameters  are  known.  With  this  approach,  therefore,  each  Kalman 
smoother  would  have  to  be  configured  to  operate  under  the  assumption  that  a  particular 
value  of  the  underlying  state  sequence  is  in  fact  the  true  value;  this  is  the  only 

way  that  the  Gaussian  assumption  would  hold.  Thus,  even  though  each  of  the  recursively 
implemented  Kalman  smoothers  might  be  considered  to  be  computationally  efficient  when 
operating  in  isolation,  a  total  of  2N~l  such  smoothers  (one  for  each  possible  state  sequence) 
would  still  be  required  to  generate  the  overall  optimal  estimate. 

A  number  of  suboptimal  techniques  have  been  proposed  in  the  literature  to  overcome  the 
computational  complexity  of  the  ARGMIX  signal  estimation  problem.  One  of  the  earliest  of 
these  techniques  was  put  forth  by  Aekerson  and  Fu  [1],  who  suggested  that  the  posterior  pdf 


Chapter  2.  Using  the  ARGMIX  Signal  Model  for  Non-Gaussian  Inference 


57 


of  the  signal  variable  be  modeled  as  purely  Gaussian  at  each  time.  Under  this  assumption, 
the  signal  estimate  at  a  given  time  index  would  be  the  mean  of  the  posterior  density;  at 
the  following  time  index,  the  pdf  of  the  predicted  signal  value  would  consist  of  M  Gaussian 
components  (owing  to  the  M-fold  branching  process  depicted  in  Figure  2-8),  but  these 
would  be  subsequently  reformed  into  a  single  Gaussian  pdf  through  a  moment-matching 
procedure.  Several  methods  similar  to  that  of  Ackerson  and  Fu  were  proposed  a  short  time 
later  [31,  165,  227]. 

A  very  different  approach  based  on  the  notion  of  random  sampling  was  taken  by  Akashi 
and  Kumamoto  [8].  They  viewed  the  collection  of  all  possible  underlying  driving-noise 
state  sequences  as  a  population,  and  they  generated  a  suboptimal  signal  estimate  using  a 
relatively  small  number  of  these  state  sequences  chosen  at  random  from  the  population. 
This  technique  had  the  theoretical  advantage  that  it  could  produce  an  estimate  arbitrar¬ 
ily  close  to  the  optimal  estimate  if  a  sufficiently  large  number  of  sequences  were  selected. 
Other  proposed  approaches  to  ARGMIX  signal  estimation  have  been  based  on  the  con¬ 
cept  of  pruning  the  tree  in  Figure  2-8  to  allow  only  the  most  likely  branches  as  candidate 
hypotheses  [179,  204].  Many  such  techniques  bear  strong  similarities  to  methods  used  for 
tracking  moving  targets  in  a  dense  multi- target  environment  [18].  In  the  remainder  of  the 
thesis,  we  shall  pursue  an  entirely  different  approach  to  the  non-Gaussian  signal  estimation 
problem;  our  method  is  based  on  approximating  the  underlying  signal  with  a  finite-state 
Markov  dynamical  model.  We  begin  exploring  this  approach  in  detail  in  Chapter  3. 

2.6  Discussion 

2.6.1  Remarks  on  the  EMAX  Algorithm 

The  computations  that  constitute  the  EMAX  algorithm  have  an  intuitively  pleasing  form, 
are  easy  to  implement  in  computer  code,  and  consume  little  computer  memory.  The  em¬ 
pirical  results  presented  in  Section  2.3  suggest  that  the  EMAX  algorithm  has  at  least  three 
distinct  advantages  over  other  techniques  proposed  for  similar  estimation  problems:  (i)  it 
produces  high-quality  estimates,  since  it  uses  the  likelihood  function  as  a  guide  for  find¬ 
ing  solutions,  (ii)  it  converges  reliably  to  a  stationary  point  of  the  likelihood  function,  by 
virtue  of  being  a  generalized  EM  algorithm,  and  (iii)  it  is  extremely  versatile  because  the 
Gaussian-mixture  pdf  is  able  to  model  a  wide  range  of  densities  very  well. 

Although  the  EMAX  algorithm  is  a  powerful  method  for  estimating  non-Gaussian  signal 
parameters,  a  number  of  issues  must  still  be  addressed  before  it  can  be  transformed  into 
a  robust  signal  analysis  tool.  For  example,  we  have  given  only  cursory  consideration  to 
the  initialization  of  the  EMAX  algorithm  in  the  analysis  presented  here.  For  the  examples 
given  in  Section  2.3,  we  adopted  an  initialization  method  based  on  its  conceptual  and  com¬ 
putational  simplicity.  This  method  worked  reasonably  well  for  the  limited  set  of  examples 
addressed  in  that  section.  However,  since  initial  estimates  are  the  key  to  good  performance, 
we  need  an  initialization  procedure  that  will  consistently  lead  to  points  of  high  likelihood 
after  the  algorithm  has  been  iterated  to  convergence.  This  would  be  a  useful  direction  for 
future  work  in  ARGMIX  parameter  identification. 

In  addition,  for  certain  problems  (particularly  those  for  which  the  Gaussian-mixture  pdf 
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contains  many  components),  it  would  be  useful  to  speed  up  the  convergence  of  the  EMAX 
algorithm.  This  might  be  accomplished  by  iterating  the  algorithm  until  reaching  the  vicinity 
of  a  local  maximum,  and  then  applying  a  more  efficient  method  (e.g.,  the  Newton-Raphson 
technique)  to  move  to  the  peak.  Also,  it  would  be  useful  to  detect,  during  the  operation  of 
the  algorithm,  whether  a  degenerate  parameter  estimate  is  being  approached,  so  that  the 
algorithm  could  be  restarted  elsewhere  in  the  parameter  space.  Furthermore,  we  note  that  in 
any  practical  setting,  our  observations  of  the  signal  of  interest  will  be  corrupted  by  additive 
noise.  For  example,  the  digital  communications  application  presented  in  Section  2.3.4  is  a 
typical  case  in  which  additive  noise  is  unavoidable.  Hence,  a  modification  of  the  EMAX 
algorithm  should  be  devised  for  estimating  the  parameters  of  an  ARGMIX  process  when 
noise  is  present. 

Another  issue  that  must  be  addressed  is  how  to  estimate  the  parameters  K  (the  order 
of  the  autoregression)  and  M  (the  number  of  constituent  densities  in  the  Gaussian  mixture) 
when  these  parameters  are  not  given  in  advance.  Moreover,  we  need  to  be  able  to  assess 
the  effect  that  incorrectly  chosen  values  for  K  and  M  would  have  on  the  variances  of  the 
remaining  parameter  estimates.  Because  the  Gaussian  mixture  model  is  quite  flexible  even 
when  M  is  very  small  (say,  2  or  3),  the  selection  of  a  suitable  value  for  M  could  probably 
be  easily  accomplished  by  trial  and  error  in  most  cases.  A  number  of  criteria  have  already 
been  proposed  for  selecting  an  appropriate  value  for  K.  The  most  widely  used  among 
these  include  the  information- theoretic  criterion  [6]  and  final  prediction  error  metric  [4], 
both  of  which  were  developed  by  Akaike,  as  well  as  the  minimum  description  length,  which 
was  developed  independently  by  Rissanen  [158]  and  Schwarz  [168].  The  approach  used  in 
each  of  these  criteria  is  essentially  to  augment  the  log-likelihood  function  with  a  penalty 
term  which  increases  monotonically  with  the  parameter  K.  Adding  such  a  penalty  term 
has  the  effect  of  counteracting  the  monotonic  decrease  of  the  prediction  error  variance 
that  typically  results  when  the  model  order  is  increased.  It  is  conceivable  that  a  new  EM 
algorithm  could  be  derived  to  identify  ARGMIX  signal  parameters  (including  the  parameter 
K )  upon  incorporating  one  of  the  model  order  estimation  criteria  mentioned  above. 

2.6.2  Suggested  Future  Direction  for  ARGMIX  Signal  Estimation 

There  is  a  relatively  simple  approach  to  the  problem  of  suboptimal  ARGMIX  signal  estima¬ 
tion  which  appears  to  have  been  overlooked  in  the  literature,  and  which  we  mention  here  as 
a  possible  direction  for  future  research.  The  motivation  for  using  this  approach  is  similar 
to  that  for  using  an  FIR  Wiener  smoother  as  an  alternative  to  the  true  Wiener  smoother 
in  the  purely  Gaussian  case.  The  basic  idea  is  to  generate  an  estimate  of  the  signal  at  each 
time  t  using  only  a  finite-length  portion  of  the  observation  in  the  vicinity  of  time  t.  The 
underlying  assumption  for  this  finite-memory  estimation  scheme  is  that  any  given  portion 
of  the  observation  is  accurately  characterized  by  a  multivariate  Gaussian-mixture  pdf  hav¬ 
ing  a  fixed  number  of  components.  Under  this  assumption,  the  processor  itself  would  be 
a  nonlinear  combination  of  the  outputs  of  a  fixed  number  of  FIR  Wiener  smoothers  (one 
for  each  component  in  the  mixture),  as  was  the  case  for  the  optimal  smoother  discussed 
in  Section  2.5.  A  potential  difficulty  in  implementing  the  approach,  however,  is  that  the 
parameters  characterizing  the  best  Gaussian-mixture  approximation  of  a  portion  of  the 
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observation  cannot  be  expressed  easily  in  terms  of  the  parameters  of  the  ARGMIX  mea¬ 
surement  model.  Moreover,  the  approach  may  perform  poorly  if  the  dependence  length 
induced  by  the  AR  filter  in  the  model  is  large  relative  to  the  length  of  the  portion  of  the 
observation  that  has  been  chosen  for  processing. 

However,  there  exists  an  alternative  method  of  implementing  this  approach  which  may 
overcome  these  difficulties.  In  particular,  we  can  first  pass  the  noisy  measurement  through 
an  invertible  LTI  system,  S,  then  apply  a  nonlinear  estimator  to  the  filtered  result,  and 
finally  pass  this  nonlinearly  processed  waveform  though  the  inverse  of  the  original  LTI  sys¬ 
tem,  <S_1.  Prom  the  principle  of  reversibility  [215],  we  know  that  if  the  nonlinear  estimator 
applied  in  the  second  stage  of  this  system  were  truly  optimal  for  the  result  produced  by 
S,  then  the  overall  system  would  be  optimal  for  the  original  noisy  measurement.  A  con¬ 
venient  choice  for  the  LTI  system  used  in  the  first  stage  is  the  inverse  of  the  original  AR 
filter.  Since  this  inverse  filter  is  linear,  we  can  examine  its  effect  on  the  signal  and  noise 
separately.  In  particular,  the  signal  becomes  whitened,  i.e.,  transformed  back  into  the  orig¬ 
inal  i.i.d.  Gaussian-mixture  driving  sequence;  on  the  other  hand,  the  observation  noise, 
which  was  originally  white  and  Gaussian,  becomes  colored  by  the  inverse  filter,  but  remains 
Gaussian.  Hence,  the  roles  of  signal  and  noise  are  now  essentially  reversed. 

Once  this  initial  transformation  has  been  carried  out,  the  non-Gaussian  samples  in  the 
measurement  exhibit  are  no  longer  mutually  dependent.  Since  any  finite-length  portion 
of  the  processed  observation  is  now  truly  characterized  by  a  multivariate  Gaussian-mixture 
pdf,  there  exists  a  clear  link  between  the  parameters  of  the  original  ARGMIX  model  and  the 
finite-memory  processor  that  should  be  used  for  smoothing.  Let  us  denote  the  transformed 
signal  and  observations  by  {yf}  and  {zt},  respectively.  Then  to  generate  an  optimal  esti¬ 
mate  of  yt  using,  say,  the  subsequence  of  observations  z t-J-.t+J,  we  would  need  to  combine 
estimates  produced  by  M2J+1  distinct  finite-length  Wiener  smoothers,  each  operating  on  a 
unique  assumption  about  the  true  value  of  the  subsequence  of  driving-noise  states 
(Recall  that  M  is  the  number  of  components  in  the  original  Gaussian-mixture  driving-noise 
pdf.)  The  appropriate  weighting  coefficients  used  to  combine  these  estimates  would  be  the 
posterior  probabilities  of  the  individual  state  subsequences,  based  on  the  value  of  z t-j-.t+j- 
Once  this  estimation  procedure  has  been  performed  for  all  samples  of  the  transformed  sig¬ 
nal,  the  resulting  sequence  of  estimates  can  then  be  passed  through  the  original  AR  filter 
once  again  to  obtain  the  final  signal  estimate. 


Chapter  3 

Approximating  Stationary  Signals 
with  Finite-State  Markov  Models 


3.1  Introduction 

The  analysis  in  the  latter  half  of  Chapter  2  identified  some  of  the  practical  difficulties 
involved  in  developing  an  optimal  technique  for  estimating  a  non-Gaussian  signal  in  additive 
noise.  Throughout  that  analysis,  our  attention  was  focused  on  an  estimation  problem  with  a 
particularly  simple  structure;  specifically,  the  signal  was  a  stationary  ARGMIX  process,  the 
noise  was  a  stationary  white  Gaussian  process,  and  the  signal  and  noise  were  assumed  to  be 
independent.  It  was  demonstrated  through  a  simple  example  that  the  optimal  estimation 
scheme  for  the  ARGMIX  problem  was  not  particularly  difficult  to  derive  or  to  understand; 
in  fact,  because  we  were  able  to  decompose  the  original  non-Gaussian  problem  into  more 
familiar,  purely  Gaussian  subproblems,  we  could  express  the  final  solution  in  closed  form 
as  a  (nonlinear)  weighted  average  of  many  different  Wiener  filters.  However,  this  optimal 
estimation  scheme  was,  practically  speaking,  impossible  to  implement  because  it  consumed 
a  prohibitive  amount  of  computation,  even  in  cases  where  the  observed  sequence  contained 
only  a  modest  number  of  samples. 

Our  immediate  temptation  in  this  situation  is  to  search  for  a  tractable,  yet  sufficiently 
sophisticated  approximation  to  the  best  processor,  with  the  hope  that  the  resulting  approx¬ 
imate  scheme  will  perform  satisfactorily  in  place  of  the  optimal  scheme.  As  we  mentioned 
at  the  end  of  Chapter  2,  this  basic  approach  has  been  taken  by  many  researchers  for  solving 
non-Gaussian  problems  similar  to  the  ARGMIX  signal  estimation  problem.  In  this  chapter, 
however,  we  shall  take  a  much  different  approach  toward  developing  approximate  signal 
processing  algorithms  —  an  approach  which  reflects  a  fundamental  shift  in  paradigm  for 
the  remainder  of  the  thesis.  In  particular,  we  will  attempt  to  approximate  the  true  random 
process  with  another  random  process  whose  structure  is  much  simpler  and  which  gives  rise 
to  algorithms  that  are  much  more  computationally  efficient. 

The  collection  of  random  processes  that  we  will  consider  as  approximations  is  the  class  of 
finite-state  hidden  Markov  models,  or  HMMs.  An  HMM  consists  of  two  basic  components: 
(i)  a  Markov  chain,  which  characterizes  the  underlying  temporal  structure  of  the  HMM  in 
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terms  of  transitions  on  a  discrete  set  of  states;  and  (ii)  a  collection  of  probability  density 
functions  (one  for  each  state  of  the  Markov  chain),  which  characterize  the  output  of  the 
HMM.  In  general,  the  state  values  assumed  by  the  underlying  Markov  chain  in  the  HMM 
are  not  directly  observable.  Instead,  at  each  time  index,  the  HMM  output  is  actually  a 
function  of  the  current  state;  this  function  is  random,  rather  than  deterministic,  and  is 
completely  characterized  by  the  pdf  assigned  to  that  state. 

This  class  of  finite-state  random  processes  was  introduced  as  a  statistical  modeling  tool 
in  a  series  of  papers  by  Baum  and  his  colleagues  [19,  20,  21,  22,  23];  since  that  time,  various 
properties  and  algorithms  associated  with  HMMs  have  been  developed  extensively  by  other 
researchers.  The  most  widespread  practical  use  of  these  models  has  been  in  the  area  of 
speech  processing;  specifically,  they  have  been  applied  to  such  problems  as  automatic  speech 
recognition,  speaker  identification,  and  language  identification,  to  name  just  a  few  [78,  84, 
85,  139,  144,  150,  155,  233].  More  recently,  HMMs  have  been  used  to  approximate  certain 
types  of  low-dimensional  dynamical  systems  (e.g.,  systems  that  are  chaotic),  for  the  purpose 
of  either  predicting  the  output  of  such  systems  or  enhancing  the  output  after  it  has  been 
contaminated  with  additive  noise  [97,  130,  157]. 

We  develop  the  HMM-based  approximation  concept  further  in  the  next  two  subsections; 
we  first  outline  the  basic  assumptions  and  notation  that  will  be  used  in  connection  with  the 
signal  approximation  problem,  and  we  then  give  a  concise  formulation  of  the  problem  itself. 
In  the  third  subsection,  we  describe  how  the  remaining  material  in  the  chapter  is  organized. 


3.1.1  Preliminary  Assumptions  and  Notation 

3. 1.1.1  Assumptions  on  the  True  Source  Signal 

Our  use  of  HMMs  as  approximating  processes  will  allow  us  to  solve  problems  involving  a 
very  broad  class  of  AR  signals,  which  includes  the  ARGMIX  subclass  as  a  special  case.  In 
the  sequel,  we  will  assume  that  the  source  signal  being  approximated,  {Yf},  is  a  stationary 
AR  process  described  by  the  Ath-order  nonlinear  difference  equation 

Yt  =  A(rt_i,yt_2,--  -  ,Yt.K,Wt),  (3.1) 

where  h(-)  is  a  deterministic  function  and  {Wt}  is  a  sequence  of  i.i.d.  random  variables, 
each  distributed  according  to  the  pdf  }\v{-)- 

When  seeking  an  approximation  to  {Yt},  we  will  find  it  convenient  to  represent  this 
signal  as  the  output  of  a  nonlinear  dynamical  system  driven  by  the  white  noise  process 
{Wt}-  In  view  of  the  process  description  given  in  (3.1),  we  see  that  a  suitable  definition  for 
the  state  vector  X*  of  such  a  dynamical  system  is  given  by 

Xt  =  -  ■  ■  ,Yt-K+i).  (3.2) 

This  definition  allows  us  to  decompose  (3.1)  into  a  dynamical  equation  and  an  output 
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Figure  3-1:  Depictions  of  possible  transitions  of  the  state  vector  from  time  t  —  1  to  time  t 
under  (a)  the  true  dynamical  structure;  and  (b)  the  quantized  dynamical  structure  within 
the  partitioned  state  space. 


equation,  as  shown  by 


Xt  =  Wt)  (3.3) 

Yt  =  G(Xt),  (3.4) 


where  the  state  transition  function  H(  )  transforms  Xj_x  and  Wt  into  X4  (via  the  original 
regression  function  h(-))  according  to 
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0 

h{Xt-i,Wt) 


(3.5) 


and  the  output  function  G(-)  merely  extracts  the  first  element  of  its  vector  argument.  This 
representation  of  the  source  signal  is  useful  because  it  allows  us  to  think  of  the  underlying 
dynamics  of  the  signal  in  a  linear-algebraic  sense,  i.e.,  in  terms  of  a  transformation  of 
the  state  from  one  time  to  the  next  within  a  if-dimensional  vector  space,  as  depicted  in 
Figure  3-1  (a). 
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3. 1.1. 2  Assumptions  on  the  HMM-Based  Signal  Approximation 


Our  approximation  to  the  true  dynamics  is  represented  by  an  X-state  HMM,  which  is 
required  to  satisfy  certain  constraints.  In  particular,  we  will  insist  that  set  of  HMM  states 
stand  in  one-to-one  correspondence  with  a  collection  of  regions  in  the  original  A-dimensional 
state  space.1  We  denote  these  regions  by  Hi,  ■  ■  •  ,  TZl,  and  we  require  that  they  satisfy 
the  conditions 


Hi  u  n2  u  •  •  •  u  nL  =  rk  (3.6) 

and 

7 D  TZj  =  0,  i  7^  j,  (3-7) 

so  that  any  point  in  RK  is  contained  in  exactly  one  of  the  7Zl.  With  the  regions  defined  in 
this  way,  we  can  explicitly  specify  a  mapping  between  the  original  continuous-valued  state 
space  and  the  discrete-valued  state  set  of  the  HMM  (which  we  take  to  be,  without  loss  of 
generality,  the  set  {1, 2,  •  •  •  ,  L}).  This  mapping,  which  we  denote  by  &{■),  is  given  by 


' 

0(x)  =  < 


1 

2 


if  x  6  TZi  ; 
if  x  6  IZy, 


ifxGT^jr. 


(3.8) 


We  will  refer  to  the  constraint  imposed  on  the  HMM  by  this  mapping  as  the  state-space 
partitioning  constraint.  A  notional  depiction  of  the  resulting  approximate  dynamics  in  the 
partitioned  state  space  is  shown  in  Figure  3-l(b). 

We  will  denote  the  underlying  X-state  Markov  chain  in  the  HMM  by  {©*},  and  we  will 
assume  that  this  chain  is  homogeneous,  i.e.,  that  any  conditional  probability  of  the  form 
Pr{©t+i  =  j\@t  =  i}  depends  only  on  the  values  of  i  and  j  and  is  entirely  independent 
of  the  value  of  the  time  variable  t.  The  stochastic  structure  of  a  homogeneous  Markov 
chain  is  completely  characterized  by  two  distinct  sets  of  parameters:  (i)  a  collection  of 
state  transition  probabilities,  which  we  denote  by  {Q{i,j)}fj= 1;  and  (ii)  a  collection  of 
initial  state  probabilities,’  which  we  denote  by  {-P(i)}j=1-  These  parameters  are  defined, 
respectively,  by 


Q(iJ)  =  Pr{©t+i  =  j\Qt  =  i},  i,j  =  1,2,- ,X  (3.9) 


lrThe  term  region,  as  it  is  used  here  and  in  the  sequel,  should  be  understood  in  an  intuitive  sense  as  a  set 
that  consists  of  a  single  piece.  To  be  more  technically  precise,  we  may  consider  a  region  to  be  a  nonempty 
connected  set  [2,  163],  i.e.,  a  set  having  the  property  that  any  two  of  its  points  can  be  joined  by  a  polygonal 
line  which  also  lies  in  the  set.  This  technical  definition  will  be  used  only  on  rare  occasions  in  the  remainder 
of  the  thesis,  however. 
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and 


P(j)  =  Pr{0o  =  j),  3  =  1>  2,  •  •  •  ,  L.  (3.10) 

Because  we  will  be  exclusively  considering  the  approximation  of  stationary  processes, 
we  impose  the  additional  constraint  that  the  HMM-based  representation  must  itself  be 
stationary.  Under  this  constraint,  the  pmf  of  the  initial  state  variable  is  identical  to  the 
marginal  pmf  for  every  other  state  variable  in  the  chain.  As  a  consequence,  the  initial 
state  probabilities  and  state  transition  probabilities  of  the  chain  are  related  through  the 
equation  [27] 


L 

P(j)  =  E  3  =  1, 2,  •  •  •  ,  L.  (3.11) 

i=  1 

In  addition,  the  joint  pmf  characterizing  a  pair  of  successive  random  variables  (©*,  0j+i)  is 
constant  for  all  time.  We  will  have  occasion  to  refer  to  this  joint  pmf  later  in  the  chapter; 
its  elements  will  be  denoted  by  {R(i,j)}jj=1. 

To  complete  the  definition  of  the  HMM,  we  now  require  only  a  collection  of  L  densities, 
one  for  each  of  the  L  states  of  the  Markov  chain.  We  shall  assume  that  these  densities,  once 
specified,  remain  fixed  for  all  time,  as  do  the  parameters  characterizing  the  Markov  chain. 
However,  we  will  adopt  two  separate  notations  for  these  HMM  output  densities,  depending 
on  whether  we  are  describing  am  approximation  for  the  sequence  of  state  vectors  {Xt}  or 
the  sequence  of  signal  variables  {Y*}.  In  the  former  case,  we  will  denote  the  output  densities 
by  {/i(-)}£=  i»  the  latter  case,  we  use  the  notation  {<?i(-)}fLi-  It  is  understood  that  the 
densities  fj(-)  and  gj{-)  correspond  to  state  j  of  the  underlying  Markov  chain,  so  that  we 
may  write 


/i(x)  =  /xi  |e<(xl0« 

3  =  1,2,--  -  ,L 

(3.12) 

and 

9j(y)  =  fYt\&t(y\et  =  j), 

j  =  1,2,  —  ,L 

(3.13) 

for  all  values  of  the  time  index  t.  We  will  assume  that  any  output  pdf  included 

in  an  HMM 

is  continuous  throughout  its  domain  (except  possibly  on  a  subset  having  zero  measure)  and 
contains  no  impulses. 

3.1.2  Problem  Statement  and  Approach  to  Solution 

We  will  assume  that  all  of  the  quantities  that  define  the  true  source  signal  —  i.e.,  the  order 
of  the  autoregression,  K,  the  regression  function  itself,  /i(-),  and  the  driving-noise  pdf,  fw(-) 
—  are  precisely  known.  Under  these  assumptions,  together  with  the  stationarity  constraint, 
we  can  (at  least  in  principle)  derive  the  exact  form  of  the  pdf  that  characterizes  the  true 
source  signal  (Y)},  or  equivalently,  the  pdf  that  characterizes  the  true  state- vector  sequence 
{Xt}.  Hence,  we  will  freely  assume  that  these  pdfs  are  also  given.  The  problem  we  consider 
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in  this  chapter  will  concern  a  finite-length  portion  of  the  true  state-vector  sequence  given 
by 


Xo:AT-i  =  (Xo,Xlr.--  ,XN-!).  (3.14) 

We  can  think  of  this  subsequence  as  the  collection  of  underlying  state  vectors  associated 
with  a  measurement  of  the  source  signal  that  we  will  make  at  some  point  in  the  future. 
Our  objective  is  to  approximate  the  random  subsequence  Xo:v-i  with  another  subsequence 
XchAr-!  given  by 


Xojv-iMXcX!,---  ,Xjv-i),  (3.15) 

where  the  elements  of  Xo^-i  are  the  outputs  of  an  L-state  HMM.  Our  HMM-based  ap¬ 
proximation  must  satisfy  both  the  stationaxity  constraint  and  the  state-space  partitioning 
constraint  described  earlier. 

Our  approach  to  finding  a  suitable  approximation  will  be  to  attempt  to  fit  the  pdf  of  the 
approximate  subsequence  Xq:JV-i  to  the  pdf  of  the  true  subsequence  Xo;jv-i-  The  attributes 
of  an  optimal  fit  must  be  defined  in  an  appropriate  statistical  sense  to  be  made  more  precise 
later  in  the  chapter.  Observe  that  the  approximating  pdf  is  entirely  characterized  by  the 
parameters  of  the  HMM,  i.e.,  by  the  initial  state  probabilities  {P(z)}^=1,  the  state  transition 
probabilities  {Q(i,  j)}fj= l5  and  the  state  output  densities  {/i(-)}£=i-  Thus,  we  seek  optimal 
values  for  these  parameters  expressed  in  terms  of  the  true  pdf.  Because  all  of  the  HMM 
parameters  depend  upon  the  mapping  d(-)  defined  in  (3.8),  our  selection  of  the  best  possible 
state-space  partition  will  be  a  critical  part  of  the  overall  optimization  procedure. 

In  most  situations,  our  actual  objective  is  not  to  approximate  the  state-vector  subse¬ 
quence  Xo:jv-i,  but  rather  to  approximate  the  signal  subsequence  Yo;jv-i  defined  by 

Y0:n-i  =  (To,Ti,---  ,Yjv-i).  (3.16) 

It  turns  out,  however,  that  the  optimal  state-vector  approximation  Xo:./v-i  is  not  only  much 
easier  to  derive,  but  can  also  be  used  to  generate  an  optimal  approximation 

Yo:^-i  =  (y0,Ti,---,yiv-i)  (3.17) 

of  the  true  signal  subsequence.  In  particular,  the  elements  of  Yo;j\r_i  are  individually  defined 

by 


Yt  =  G(Xt),  <  =  0, 1,  •  ■  •  ,N  —  1,  (3.18) 

where  the  function  G'(-)  returns  the  first  element  of  its  vector  argument.  It  is  straightforward 
to  show  that  the  signal  approximation  Yo^-i  is  once  again  the  output  of  an  HMM;  in  fact, 
this  HMM  has  exactly  the  same  Markov  chain  parameters  as  those  derived  for  Xo.-jv-i- 
However,  an  important  change  to  the  model  is  that  a  univariate  pdf  gi(-)  must  now  be 
assigned  to  state  i  of  the  HMM  to  take  the  place  of  the  if- variate  pdf  /*(•)  that  was 
assigned  earlier.  In  light  of  the  relationship  in  (3.18),  we  see  that  this  new  pdf  can  readily 
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be  derived  from  the  old  pdf  by  integrating  out  the  unnecessary  elements,  as  shown  by 

9i(y)=  fi{y,yt-\:t-K)dyt-\:t-K-  (3.19) 

3.1.3  Chapter  Organization 

The  remainder  of  the  chapter  is  organized  in  the  following  way.  We  begin  by  motivating 
and  defining  a  figure  of  merit  by  which  we  can  assess  the  quality  of  a  given  HMM-based 
approximation;  this  allows  us  to  formulate  the  approximation  procedure  as  a  well  defined 
optimization  problem  over  a  subclass  of  HMMs.  We  then  give  a  detailed  formulation  and 
solution  of  the  approximation  problem  in  the  case  where  the  true  signal  is  taken  to  be  a  sta¬ 
tionary,  first-order  random  process  which  is  in  general  non-Gaussian.  The  results  obtained 
through  this  theoretical  analysis  sire  then  implemented  using  a  numerical,  gradient-based 
algorithm;  with  this  algorithm,  we  develop  several  different  HMM-based  approximations 
for  a  specific  first-order  AR  linear-Gaussian  random  process,  and  we  demonstrate  that  the 
accuracy  of  the  approximation  improves  as  the  number  of  states  included  in  the  model  is 
increased.  We  then  show  that  our  theoretical  results  can  be  readily  generalized  so  that  they 
apply  equally  well  to  higher-order  stationary  AR  processes.  Finally,  we  provide  a  discussion 
of  the  key  concepts  developed  in  the  chapter,  including  an  analysis  of  the  advantages  and 
limitations  of  certain  assumptions  we  have  placed  on  the  true  and  approximate  processes. 


3.2  Establishing  Criteria  for  a  Good  Approximation 

Thus  far  we  have  stated  that  the  mathematical  structure  of  our  signal  approximation  will 
take  the  form  of  an  HMM,  but  we  have  not  yet  established  criteria  that  would  allow  us  to 
determine  whether  a  particular  approximation  is  the  best  within  a  specified  class.  We  begin 
this  section  by  describing  a  universal  metric  by  which  the  best  possible  signal  approximation 
can  be  identified.  The  optimization  procedure  based  on  this  metric  would,  roughly  speaking, 
seek  to  maximize  the  confusion  of  an  observer  who  is  trying  to  distinguish  between  the 
true  and  approximate  processes  using  complete  statistical  information  about  each  process. 
Because  this  metric  is  difficult  to  analyze,  however,  we  ultimately  settle  on  an  alternative, 
information-theoretic  figure  of  merit  known  as  Kullback-Leibler  distance,  which  provides  a 
measure  of  similarity  between  the  pdf  of  the  true  signal  and  the  pdf  of  the  approximate 
signal. 

3.2.1  Minimax  Probability-of-Error  Approach  to  Approximation 

Let  us  first  consider  the  following  experiment,  which,  for  the  sake  of  simplicity,  involves  the 
approximation  of  only  a  single  random  variable,  rather  than  an  entire  random  process:  Let 
fy(-)  be  the  known  pdf  of  the  random  variable  Y,  and  let  T  =  {/y(-)}  be  a  collection  of 
feasible  approximations  of  fy(-).  We  wish  to  find  the  best  approximation  contained  in  the 
set  T.  To  determine  the  quality  of  a  particular  approximation,  we  seek  the  assistance  of 
an  independent  observer  and  carry  out  our  quality  test  in  the  form  of  a  game.  Specifically, 
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suppose  the  observer  is  informed  that  he  will  be  given,  in  each  of  a  series  of  experimen¬ 
tal  trials,  a  set  of  realizations  {-st}^1  fr°m  the  N  random  variables  {Z*}^1,  which  are 
independent  and  identically  distributed  according  to  the  pdf  /z(-).  His  task  during  each 
experimental  trial  is  to  render  a  decision  indicating  which  of  the  following  two  hypotheses 
is  true: 


Ho  :  fz(')  =  fy(-) 
Hi  :  M-)  =  fy(-) 


The  observer  is  furnished  with  complete  descriptions  of  both  fy(-)  and  /y(-),  and  he  is  told 
that  the  hypotheses  Ho  and  H\  are  equally  likely  to  be  true.  Our  game  is  structured  such 
that,  when  the  observer  guesses  correctly,  we  must  pay  him  a  fixed  amount  of  money;  when 
the  observer  guesses  incorrectly,  he  must  pay  us  a  fixed  amount.  The  goal  of  either  player 
in  the  game  is  to  maximize  his  total  winnings  over  the  series  of  experimental  trials. 

Clearly,  with  his  complete  knowledge  of  the  experimental  setup,  the  observer  can  im¬ 
plement  the  best  possible  statistical  test  for  deciding  between  Ho  and  Hx,  namely  the 
likelihood  ratio  test  (LRT)  or  Neyman-Pearson  test  [15,  94,  132,  147,  215].  Thus,  to  render 
his  decision,  he  uses  the  optimal  rule 


Declare 


/  Hq 

l  HX 


true  1 


true 


J 


when  t{ zo 


:N-l)  {  >  }°’ 


where  £(■)  is  the  log-likelihood  ratio  defined  by 

/y0:.v-i(zO:AT-i) 


£(z0:AT_i)  =  log 


(Z0:JV— l) 


(3.20) 


(3.21) 


Of  course,  we  are  well  aware  that  the  observer  will  use  the  best  available  decision  rule  on 
each  trial;  to  do  otherwise  would  be  to  unnecessarily  increase  the  chance  of  a  monetary  loss. 
It  can  easily  be  shown  that,  when  the  observer  uses  this  optimal  rule,  his  expected  winnings 
will  increase  directly  with  the  probability  that  he  makes  a  correct  decision.  If  we  now  use 
our  knowledge  of  the  two  densities  involved,  as  well  as  our  knowledge  of  the  observer’s 
gaming  strategy,  we  can  quantify  his  winnings  simply  by  calculating  this  probability.  This 
calculation  yields 


Pr  {Correct  decision} 

=  Pr{22o  true,  Declare  Hq}  +  Pr{ffi  true,  Declare  Hx}  (3.22) 

=  Pr{i7o  true}Pr{Declare  Hq  \  Ho  true} 

+  Pr{Ui  true}Pr{Declare  H\  \  Hx  true}  (3.23) 

=  iPr^Zo-jv-x)  <  0  |  Ho  true}  +  ±Pr{^(Z0:;v-i)  >  0  |  Hx  true}  (3.24) 

/0  roo 

ft\Ho(t\fz  =  fy)d£  +  U  hwMfz  =  fy)dt,  (3.25) 

■oo  J0 


where  fe\H0{‘)  and  fe\Hi  (')  are  the  conditional  densities  of  the  log-likelihood  ratio  £(■)  given 
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that  Ho  is  true  and  that  Hi  is  true,  respectively.  Note  in  the  final  step  above  that  we 
have  indicated  explicitly  the  dependence  of  the  observer’s  winnings  on  the  approximate  pdf 

/*(•)• 

The  probability  expressed  in  (3.25)  will  be  fixed  once  we  have  specified  the  pdf  fy(-)- 
We  are  allowed  to  select  any  pdf  from  the  set  T.  Clearly,  if  we  are  to  maximize  our  winnings, 
we  should  choose  the  pdf  that  makes  the  above  probability  as  small  as  possible.  That  is, 
the  optimal  pdf  /£(•)  will  be  the  one  that  satisfies 

/£(•)  =  arg  min  (l  [  ft\Ho(Z\fz  =  fy)  +  \  f  fe\Hr  (£|  fz  =  fy)  •  (3.26) 

I  J-OO  JO  ) 

Although  we  have  expressed  the  optimal  density  here  in  terms  of  the  observer’s  probability 
of  correct  decision,  we  could  equivalently  represent  it  in  terms  of  his  probability  of  error. 
From  this  alternative  perspective,  the  optimal  density  in  T  is  the  one  causes  the  most 
confusion  for  an  observer  who  is  trying  to  discriminate  between  /y(-)  and  fy(-)  using  a 
statistically  optimal  test. 


3.2.2  Kullback-Leibler  Distance  as  a  Figure  of  Merit 

While  it  is  very  useful  to  reason  through  an  experiment  such  as  the  one  described  above  to 
find  the  best,  most  general  measure  of  approximation  quality,  unfortunately  the  conclusions 
we  have  drawn  cannot  be  readily  applied  because  the  optimization  problem  in  (3.26)  is 
extremely  difficult  to  solve.  We  must  therefore  search  for  an  alternative  metric  which 
also  indicates  the  degree  of  similarity  between  two  distributions,  but  which  is  much  more 
mathematically  tractable.  Although  several  such  metrics  are  available,  we  now  turn  our 
attention  to  a  popular  information-theoretic  metric  that  is  particularly  well  suited  to  our 
approximation  problem,  namely  Kullback-Leibler  distance.2 

If  fy{-)  and  fy (•)  are  univariate  probability  density  functions,  then  the  Kullback-Leibler 
distance  between  them,  which  we  denote  by  V(fy ,  /y),  is  defined  by 


V(fy,  f9)  =  f  MV )  dy,  (3.27) 

where  y  represents  the  region  of  support  for  fy  (■).  If,  instead,  fy  (•)  and  fy  (•)  are  univariate 
probability  mass  functions,  then  the  appropriate  definition  of  Kullback-Leibler  distance  is 


Z>(/y  ,/?)  =  £  My)  l°g 
yey 


fy(y) 

fyivY 


(3.28) 


2The  Kullback-Leibler  distance  measure  derives  its  name  from  the  authors  who  originally  introduced  it 
into  the  information  theory  literature  [105].  This  metric  is  also  discussed  at  length  in  [106],  and  has  since 
been  investigated  extensively  by  a  number  of  other  researchers  [13,  43,  53, 175, 176].  It  is  referred  to  by  many 
different  names  in  the  literature,  including  cross  entropy,  relative  entropy,  directed  divergence,  information 
divergence,  /-divergence,  and  discrimination  information. 
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where  y  now  represents  the  set  of  possible  outcomes  under  the  pmf  fy(-).  When  applying 
either  of  the  above  definitions  in  the  sequel,  we  will  use  the  conventions 

a  log 

which  follow  from  limiting  arguments  using  the  continuity  of  the  function  olog(o/6)  on 
the  set  {(a,6)|(a,  b)  €  (0,  oo)  x  (0,  oo)}.  Note  that  both  (3.27)  and  (3.28)  have  obvious 
extensions  to  the  case  in  which  the  functions  fy(-)  and  fy(-)  are  multivariate. 

The  Kullback-Leibler  distance  serves  as  a  convenient  measure  of  our  ability  to  discrimi¬ 
nate  between  two  classes  of  random  observations.  Suppose,  for  example,  that  /y  (y)  =  fy(y) 
for  y  €  M  (except  possibly  on  a  set  of  measure  zero).  In  this  case,  if  we  were  given  obser¬ 
vations  that  were  equally  likely  to  be  realizations  of  Y  or  of  K,  we  should  not  expect  to 
be  able  to  determine  the  true  source  of  the  observations  any  better  than  could  a  simple 
coin  flip.  This  is  a  very  special  case  in  that  it  represents  the  smallest  achievable  degree  of 
discriminability  between  two  distributions  and  simultaneously  yields  the  smallest  possible 
value  of  !>(•),  for  we  have  from  (3.27)  that  V(fy,  fy)  =  0.  Now  let  us  consider  the  opposite 
extreme  in  which  /y(-)  and  fy(-)  have  non-overlapping  regions  of  support,  i.e.,  the  case  in 
which  fy(y)  =  0  whenever  fy(y)  >  0,  and  vice  versa.  In  this  case,  we  could  determine 
the  true  source  of  any  observations  given  to  us  with  absolute  certainty,  even  if  we  were 
given  only  a  single  realization  from  either  Y  or  Y.  Accordingly,  in  this  case  we  have  that 
P(/y,/y)  =  oo.  This  demonstrates  that  using  Kullback-Leibler  distance  as  a  measure  of 
discriminability  is  at  least  reasonable  in  each  of  the  two  extreme  cases. 

Strictly  speaking,  the  function  V(-)  is  not  a  true  distance  function  by  the  standard 
mathematical  definition  [161].  In  particular,  although  it  is  true  that  T>(fy,fy)  >  0  with 
equality  if  and  only  if  fy{y)  =  fy{y)  almost  everywhere,  it  is  not  true  in  general  that  'D(-) 
satisfies  either  the  symmetry  property 

%/y)=%,/y)  (3-30) 


a  _l  0  ifa  =  0  and  b  >  0; 
b  }  oo  if  a  >  0  and  b  =  0, 


(3.29) 


or  the  triangle  inequality 


V(fY,  fy)  <  V(fy,  fy)  +  fy).  (3.31) 

Nonetheless,  it  is  useful  to  adopt  the  notion  that  the  “distance”  between  the  /y(-)  and  fy(-) 
increases  as  V(fy,  fy)  increases,  in  the  sense  that  the  associated  random  variables  Y  and  Y 
become  easier  to  distinguish.  In  Appendix  D,  we  discuss  in  detail  how  the  Kullback-Leibler 
distance  relates  to  other,  more  familiar  statistical  measures  of  approximation  quality. 

3.3  Optimal  HMM-Based  Approximation  of  a  First-Order 
AR  Process 

Having  established  a  suitable  metric  for  assessing  approximation  quality,  we  now  seek  to 
apply  the  above  concepts  in  a  very  simple,  illustrative  case.  Specifically,  in  this  section  we 
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derive  an  optimal  HMM-based  approximation  to  a  stationary  signal  {It}  which  is  assumed 
to  obey  the  first-order  nonlinear  difference  equation 


Yt  =  h(Yt-UWt),  (3.32) 

where  h(-)  is  a  deterministic  function  and  {Wt}  is  a  sequence  of  i.i.d.  random  variables 
described  by  the  pdf  We  assume  that  complete  descriptions  of  the  functions  h(-)  and 

fw(-)  are  given.  In  addition,  since  the  pdf  of  the  random  process  {Y)}  can  be  determined 
exactly  from  this  given  information,  we  assume  that  it,  too,  is  known. 


3.3.1  Some  Preliminary  Observations 

Before  attempting  a  detailed  problem  formulation,  let  us  first  discuss  certain  basic  aspects 
of  the  first-order  signal  approximation  problem.  Observe  from  (3.32)  that  the  scalar-valued 
signal  variable  Yt  by  itself  constitutes  a  suitable  state  vector  for  the  dynamical  system  at 
time  t.  Thus,  since  the  state  vector  is  only  one-dimensional  in  this  case,  we  can  consider 
the  state  space  to  be  the  real  line,  and  we  can  think  of  the  disjoint  “regions”  in  state  space 
referred  to  earlier  to  be  disjoint  intervals  whose  union  makes  up  the  real  line.  In  this  one¬ 
dimensional  example,  therefore,  a  segmentation  of  the  state  space  into  L  regions  can  be 
conveniently  described  by  a  collection  of  L  + 1  distinct  points  do, d\ ,  •  -  •  ,di  on  the  real  line, 
as  depicted  in  Figure  3-2.  We  refer  to  these  as  breakpoints,  and  we  assume  that  they  satisfy 
the  conditions 


— oc  =  do  <  di  <  •  ■■  <  dL- 1  <  di  =  oc. 


(3.33) 


We  will  sometimes  use  the  vector  notation  d  to  refer  to  the  ordered  collection  of  breakpoints 
(d0,di,---  ,di). 

The  above  conditions  imply  that  the  mapping  $(•),  which  enforces  the  state-space  par¬ 
titioning  constraint,  is  now  defined  by 


fi 


%)  =  < 


if  — oo  <  y  <  di; 
if  d\<y  <  d2\ 


u  if  dL-i  <  y  <  oo. 


(3.34) 


We  will  make  extensive  use  of  this  mapping  as  we  derive  the  optimal  finite-state  approx¬ 
imation  to  the  first-order  AR  signal  {Yt}.  Recall  from  our  definition  of  an  HMM-based 
representation  that  the  densities  {/t(-)}£=i  describing  the  state  vector  were  restricted  in 
that  their  regions  of  support  were  not  allowed  to  overlap.  In  the  present  case,  since  the 
state  variable  at  time  t  and  the  signal  variable  at  time  t  are  identical,  these  same  region- 
of-support  constraints  also  apply  to  the  densities  {<7i(-)}£=i-  In  particular,  the  region  of 
support  for  the  function  <?,(-)  will  be  the  interval  [d;_i,  d;]. 
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Figure  3-2:  Partitioning  of  tlie  state  space  in  the  first-order  signal  approximation  problem 
via  the  breakpoints  do,  d\,  •  -  •  ,  di-  The  resulting  segmentation  of  the  state- variable  pdf  is 
also  indicated. 

3.3.2  Formulation  of  the  Approximation  Problem 

The  Kullback-Leibler  distance  between  the  densities  for  the  true  signal  vector  Y  and  the 
HMM-based  approximate  signal  vector  Y  is  given  by3 

2>(/y,  /y)  =  J  My)  log  dy.  (3.35) 

For  our  purposes,  however,  it  will  be  more  convenient  to  work  with  the  alternative  version 
of  this  expression  given  by 

P(/y,  /y)  =  J  /y(y)  log  fy{y)  dy  -  J  /Y(y)  log  /y( y)  dy.  (3.36) 

Because  the  true  pdf  fy{-)  is  fixed  and  known,  the  first  term  on  the  right  hand  side  above  is 
completely  independent  of  the  parameters  we  will  choose  for  the  approximating  pdf  fy{-)- 
Thus,  minimizing  the  original  objective  function  P(/y,  fy)  is  equivalent  to  maximizing  the 
modified  objective  function  V'(fy,  fy)  defined  by 

P'(/Y,  /y)  =  f  My)  log  fy(y)  dy.  (3.37) 

The  maximization  of  V(fy,  fy)  is  to  be  carried  out  over  a  set  of  approximate  densities 
having  a  very  special  structure.  Specifically,  each  density  in  the  set  can  be  character¬ 
ized  by  a  tuple,  which  we  denote  by  'S’' ,  consisting  of  all  parameters  needed  to  specify  an 
L-state  HMM-based  representation  of  the  true  signal;  these  parameters  include  the  break¬ 
point  vector  d,  the  initial  state  probabilities  {P(i)}fL1;  the  state  transition  probabilities 


3In  this  section  and  in  many  of  the  remaining  sections  in  the  chapter,  we  will  use  the  symbol  Y  to  refer 
to  the  signal  vector  being  approximated,  in  place  of  the  more  cumbersome  symbol  Yq:n-i. 
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{Q(i,  and  the  output  densities  {<?i(-)}£=i-  Let  us  denote  by  V  the  collection  of  all 

such  tuples  that  satisfy  the  constraints  mentioned  in  the  preceding  subsection  as  well  as 
those  outlined  in  the  introduction  to  the  chapter.4  Our  problem  is  then  to  find  the  best 
such  tuple  'Er*,  which  is  defined  by 

=  argmax  [  fY{y)  log  /*( y;  &)  dy.  (3.38) 


3.3.3  Derivation  of  the  Approximate  Signal  Density 

We  now  wish  to  derive  an  expression  for  the  pdf  of  the  approximate  signal  vector  Y  = 
{Yo,Yi,--«  •  Yv-i)  in  terms  of  the  parameters  that  characterize  its  associated  HMM.  We 
proceed  in  the  usual  way  by  first  accounting  for  all  possible  values  of  the  underlying  state 
sequence  0  =  (0o,  0i,  •  -  •  ,  0jv-i)  and  then  conditioning  the  HMM  output  on  each  of  these 
contingencies.  By  following  this  approach,  we  arrive  at  the  initial  pdf  expression 

/y(y)  =  E  Pr<0  =  0}/y  |e(y  !©  =  «)•  (3-39) 

e 

We  can  introduce  a  bit  more  detail  into  this  expression  by  taking  advantage  of  two  special 
properties  of  the  HMM  structure,  namely  that  (i)  the  state  sequence  ©  obeys  the  Markov 
property;  and  (ii)  the  elements  of  the  output  sequence  Y  are  statistically  independent  when 
conditioned  on  a  particular  value  of  ©.  Using  property  (i),  we  can  write 


N-l 

Pr{©  =  0}  =  Pr{0o  =  eQ}  •  J]  Pr{0t  =  I  ©i-i  =  Ot- 1} 

t= l 

N- 1 

=  p(e0)-UQ(9t.l,dt). 

t= i 

Prom  property  (ii),  we  have  that 

-  N-l 

/Yi©(yl0  =  0)=  U /Yd©t(yt  I  =  ^t) 

t=0 

N-l 

=  n  9eM‘ 
t- o 


(3.40) 

(3.41) 


(3.42) 

(3.43) 


4 For  the  moment,  we  leave  the  structure  of  each  pdf  (?;(•)  unconstrained.  This  will  not  pose  any  difficulty 
during  the  analysis  presented  in  this  chapter.  In  later  chapters,  however,  we  will  find  it  convenient  from  a 
practical  standpoint  to  restrict  this  pdf  to  be  a  Gaussian  mixture  with  a  fixed  number  of  components. 
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By  substituting  (3.41)  and  (3.43)  back  into  (3.39),  we  obtain  the  alternative  expression  for 
fy(')  given  by 


L  L  L  N- 1  N- 1 

fr(y)  =  EE-  E  P W  •  II  QVt-iA)  II  9»M-  (3-44) 

60=l6i=l  0N-!=l  t=  1  t=0 

Based  on  this  new  expression,  it  appears,  at  least  upon  first  inspection,  that  the  pdf  of  Y 
is  a  very  complex  function;  specifically,  the  pdf  is  represented  above  as  a  sum  of  LN  terms, 
where  each  term  accounts  for  a  possible  realization  of  the  underlying  state  sequence.  In 
fact,  almost  all  of  the  terms  appearing  in  the  summation  in  (3.44)  are  equal  to  zero;  the 
sole  exception  is  the  term  corresponding  to  the  particular  state  sequence 

9  =  (%o),  %i),  •  •  •  ,  %jv-i)).  (3.45) 

Using  this  fact,  we  can  now  express  the  pdf  of  Y  in  the  much  simpler  form 

N- 1  N-l 

fy(y)  =  P(0(yo))  •  n  Q(0(yt-i),e(yt))  ■  n  9e{yt){yt).  (3.46) 

t=l  <=0 


3.3.4  Decomposition  of  the  Objective  Function 

Now  that  we  have  derived  an  expression  for  the  pdf  of  the  approximate  signal  vector,  let  us 
once  again  turn  our  attention  toward  the  maximization  of  our  objective  function  V(fy,  fy). 
It  is  apparent  from  (3.37)  that  we  first  need  an  expression  for  the  natural  logarithm  of  the 
pdf  /*(•)■  Using  (3.46),  we  easily  have  that 


N-l 


N-l 


log/v(y)  =  togF^yo))  +  E  losQ(0(yt-i),O{yt))  +  ^  logge{yt)(yt).  (3.47) 


t=i 


<=o 


Upon  substituting  this  expression  into  (3.37),  we  obtain  a  more  explicit  form  of  the  objective 
function  given  by 


P'(/y,/y)  =  J  /Y(y)logP(%o))dy 


+  J  /v(y) 

■jv-i 

+ — i 

r 

L  1- — X 

‘ N-l 

-J 

+  J  /y(  y) 

E  loS^(vt)(y«) 

.  t-1 

dy 

—  4-  T>2  +  T> 3, 


(3.48) 

(3.49) 


where  the  terms  V i,  V 2,  and  V3  are  defined  in  the  obvious  way.  Observe  that  these 
three  terms  involve  different  components  of  the  HMM,  and  can  therefore  be  maximized 
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separately.0  This  is  precisely  the  strategy  we  shall  pursue  in  the  next  three  subsections. 


3.3.4. 1  Maximization  of  V\ 


We  begin  by  solving  for  the  values  of  the  initial  state  probabilities  {.P(i)}iLi  that  maximize 
the  term  T>\ .  Note  first  that  V\  can  be  written  as 


T>i 


J  fr{ y)  \ogP(0{yo))dy 

/OO 

fY0(yo)  log  P{9{yo))  dyQ, 

•OO 


(3.50) 

(3.51) 


where  in  the  latter  step  we  have  eliminated  the  superfluous  variables  yi,y2,-”  , y.v-i  by 
integrating  them  out.  To  simplify  the  expression  for  T>\  even  further,  it  will  be  convenient 
to  represent  the  integral  in  (3.51)  as  a  sum  of  integrals  taken  over  disjoint  portions  of  the 
real  line.  In  particular,  we  use  collection  of  breakpoints  {do,  di,  •  •  •  ,  di}  to  segment  the  real 
line  according  to 


M  =  [do,  di]  U  [dj,  d2]  U  ■  •  •  U  [di_i,  dj, 


and  then  write  T>\  as 


Vi 


fYo(yo)logP{0{yo))dyo. 


(3.52) 


(3.53) 


Next,  recall  that  the  function  #(•)  is,  by  definition,  constant  over  any  interval  of  the  form 
[dj_i,dj],  and  can  therefore  be  factored  out  of  each  of  the  above  integrals.  This  leads  to 
the  expression 


L 

j= i 


lj-i 


/y0(yo)  dyoj  log  P(j). 


(3.54) 


Under  the  assumption  that  the  breakpoints  are  fixed  and  that  all  constraints  on  the  initial 
state  probabilities  are  satisfied,  the  above  sum  will  be  largest  if  we  use  the  assignment 


P(j)  =  [  Iy0 (yo)  dy0,  j  =  1, 2,  ■  -  -  ,  L. 

Jdj- 1 

A  proof  of  this  claim  can  be  found  in  Appendix  B. 


(3.55) 


sActually,  this  is  not  entirely  true,  since  the  initial  state  probabilities  {P(i)}f=1  and  the  state  transition 
probabilities  {Q(t,  are  coupled  through  the  stationarity  constraint  given  in  (3.11).  However,  as  we 

shall  soon  discover,  a  fruitful  strategy  in  this  case  is  to  proceed  with  maximizing  the  terms  separately  and 
then  to  verify  that  the  optimal  solutions  do  indeed  satisfy  the  constraint. 
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3. 3.4.2  Maximization  of  Z>2 

We  now  turn  to  the  problem  of  maximizing  V 2  through  an  appropriate  choice  of  the  state 
transition  probabilities  {Q{i,  j)}fj-\-  First,  observe  that  we  can  write  V2  as 

/N- 1 

/y( y)  l°sQ(^(yt-i),0{yt))  dy 

t=  1 

A'-l 

=  fY(y)l°gQ(e(yt-i)i9(yt))dy 

t= 1 J 

N-l  . 

t=i  ^ 

where  in  the  second  step  we  have  interchanged  the  order  of  summation  and  integration, 
and  in  the  last  step  we  have  once  again  eliminated  the  superfluous  variables  within  each 
integral,  i.e.,  all  variables  having  indices  other  than  t  —  1  or  t.  At  this  point,  however,  we 
can  simplify  (3.58)  even  further  by  using  the  fact  that  the  process  {Yj}  is  stationary,  and 
therefore  that  the  condition 


(3.56) 

(3.57) 

(3.58) 


jVt_i,Yi(j/o,yi)  =  fYoXiiyo^yi)  (3.59) 

is  satisfied  for  all  yo,  yi  €  R  and  for  t  =  1, 2,  •  •  •  ,  N  —  1.  This  implies  that  (3.58)  can  be 
written  as 

Y— 1  poc  poo 

/  fY0,Y1{yo,yi)^ogQ(d{yo),d(yi))dy0dyi  (3.60) 

/oo  roc 

/  fYo,Yi(yo,yi)logQ(d{yo),e(yi))dy0dyi.  (3.61) 

-00  J -OO 


As  a  final  step  in  reducing  this  expression  to  simplest  terms,  we  once  again  invoke  the 
two-part  strategy  in  which  we  first  decompose  each  integral  into  a  sum  of  L  integrals  over 
segments  of  the  form  [dj-i,dj],  and  then  use  the  fact  that  the  function  £?(■)  is  constant  over 
each  such  segment  so  that  we  can  factor  it  out  of  the  integral.  Applying  this  strategy  yields 
the  new  formulas 


X>2  =  (N  -  1)  ^2  f  l  fy°’Y 1  2/1)  lo§  Q(8(yo),  %i))  dyo  dyi 

j= 1  ^  di~i  J  dj—\ 

L  L  /  rdj  \ 

=  <jv-i)EE  /  /  fY0,Yi(yo,yi)dyodyi  J  log Q{i,j). 

1=  1  7  =  1  Vdi- 1  / 


(3.62) 

(3.63) 


The  decomposition  of  the  joint  bivariate  pdf  Jy0,Yi  (•)  implied  by  (3.63)  is  depicted  in  Fig¬ 
ure  3-3. 

Now  observe  that,  for  a  fixed  value  of  i,  the  elements  Q(i,  1),  Q(i,  2),  -  ■  •  ,  Q(i,  L)  must 
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Figure  3-3:  Contour  plot  of  joint  bivariate  pdf  of  two  successive  state  variables  and  its 
corresponding  decomposition  into  rectangular  regions  by  the  breakpoints  do,  di,  -  ■  •  , de¬ 


form  a  pmf.  Moreover,  there  are  L  such  pmfs  that  make  up  the  entire  collection  of  state 
transition  probabilities  {Q(i,j)}ij= i*  and  each  of  these  L  pmfs  is  entirely  independent  of 
the  others;  hence,  we  can  solve  for  each  pmf  separately.  It  can  be  shown  (as  before,  using 
the  arguments  given  in  Appendix  B)  that  the  elements  Q(i,j )  that  maximize  X>2,  subject 
to  the  usual  normalization  constraints,  are  given  by 


IdU  II ti  (yo,  yi)  dyo  dyi 


It,  />o(yo)dy0 


*\J  =  1,2  ,•••,£• 


(3.64) 


3. 3.4. 3  Maximization  of 

Finally,  we  consider  maximizing  the  term  £>3.  As  before,  we  can  apply  the  usual  manipu¬ 
lations  to  integrals  and  summations  within  £>3,  taking  advantage  of  the  stationarity  of  the 
process  { Yt }  as  well  as  the  special  structure  of  the  indexing  function  &(■)  and  the  breakpoint 
vector  d,  to  obtain 


.  N- 1 

J  /y( y)  log^(y4)(y«)  dy 

(3.65) 

Yl  J  fy  (y )  los  9e(yt )  (vt )  dy 

(3.66) 
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v^1  f°° 

=  E  /  -fa  (w)  lo§  (w) 

t=i  ■/-°° 

/co 

/vb  (yo)  log  5e(yo)  (yo)  %o 

-OO 

L  ^  rdj 

=  -W  E  /  -fro  (yo)  log  9j  (yo)  dyo 

j=i 


(3.67) 


(3.68) 


(3.69) 


Under  the  assumption  that  d  is  fixed,  we  can  now  maximize  each  of  the  L  terms  in  (3.69) 
separately,  since  they  axe  entirely  uncoupled.  Once  again,  from  arguments  presented  in 
Appendix  B,  it  follows  that  the  optimal  output  densities  {y?  (■)}}=  i  of  the  HMM  are  given 


fYoiy) 

9j{y)  =  <  ij-i  fyo(u)du 


if  dj-i  <  y  <  dj\ 
otherwise; 


3  =  1,2,  -  --  ,L. 


(3.70) 


3. 3.4. 4  Verification  of  the  Stationarity  Constraint 

Thus  far,  we  have  adopted  the  strategy  of  maximizing  each  of  the  terms  Vi,  V 2,  and  T>z 
without  regard  to  the  constraint  that  the  derived  Markov  chain  {©(}  must  be  stationary. 
This  requirement  has  no  effect  on  the  output  densities  associated  with  the  states,  but  it  does 
place  a  simultaneous  restriction  on  the  initial  state  probabilities  and  the  state  transition 
probabilities.  We  now  verify  that  the  solutions  obtained  to  the  unconstrained  maximization 
problems  above  actually  meet  this  constraint.  Recall  that  we  must  have 


(3.71) 


If  we  now  insert  into  this  expression  the  optimal  values  we  have  already  obtained  for  P(i) 
and  we  find  that 


^  \  ( tf-i  4-1  froAyo:i)dyo-.i 

£p(t)Q(i,j)  =  ]T  /  fY0(yo)dyo  - di  -  - - 


fdi_  x  /vb(yo)  dyo 


=  E/  [  /Y0:1(yO:l)rfyO:l 

1  J  di  —  1  Jdj  —  i 


=  [  E  f  fY0,Yi(yo^yi)dyidyo 

Jdj-l  j=i  ddi- j 
rdj  rco 

=  /  fY0,Yi(yo,yi)dyidyo 

J  dj- 1  J  — 00 


(3.72) 


(3.73) 


(3.74) 


(3.75) 
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Jdj-i 


(yo)  dyo 


=  Pti), 


(3.76) 

(3.77) 


which  is  what  we  wished  to  show. 


3.3.5  Identification  of  Optimal  Breakpoints 

When  solving  each  of  the  maximization  problems  above,  we  assumed  the  values  of  the 
breakpoints  do,  di,  •  •  •  ,  dz,  were  fixed.  Consequently,  for  any  given  collection  of  valid  break¬ 
point  values,  we  can  now  obtain  optimal  solutions  for  the  HMM  parameters  directly  from 
the  formulas  in  (3.55),  (3.64),  and  (3.70).  Clearly,  however,  any  change  in  the  values  of  the 
breakpoints  will  bring  about  a  corresponding  change  in  the  values  of  these  conditionally  op¬ 
timal  solutions.  The  only  remaining  part  of  the  overall  maximization  problem,  and  the  part 
which  we  presently  consider,  is  the  optimal  selection  of  the  breakpoints.  To  find  the  optimal 
breakpoints,  let  us  first  reconstruct  the  original  objective  function  V  —  V\  4-  X>2  +  P3  using 
all  of  the  optimal  solutions  just  obtained  for  a  fixed  value  of  the  breakpoint  vector  d. 

Note  that  if  we  take  the  optimal  values  of  the  initial  state  probabilities  from  (3.55)  and 
substitute  them  back  into  the  derived  expression  for  V\  (given  in  (3.54)),  we  find  that  the 
largest  possible  value  of  V\,  conditioned  on  a  particular  value  of  d,  is  given  by 

^1  =  Xj  {jd  fYo(yo)dyoj  log  ^  fY0(yo)dyoj  .  (3.78) 


Next,  if  we  take  the  optimal  values  of  the  state  transition  probabilities  from  (3.64)  and 
substitute  them  back  into  the  derived  expression  for  X>2  (given  in  (3.63)),  we  conclude  that 
the  maximum  conditional  value  of  V2  is  given  by 


V%  =  IN  - 


L  L  f  pdi  rdj  \ 

("-‘lEE  /  /  /v„,  i(y0:l)dy0:l  )  log 

i=l  j= 1  Jdj-i  J 

V*>  \  (  fd'  [di  \ 

(N  -  !)  2^  X,  /  /  /Yo:i(yo:i)  dyo-.l  log  /  /  /Y0;1(yo:i)  dyo:i  ) 

t=l  j=  1  \Jdi- 1  Jdi-'  )  \Jdi-i  Jdi- 1  J 

-  iN  -  !)  E  (  [  [  «^Yq:1  (you)  rfyonl  log  (  /  /%(l/o)  ^0^) 

J  Vdi-l  J 

(  fdi  fdj  A  (  fdi  fdj  \ 

(W-i)2_^2^  /  /  /y0:1  (y0:l )  dyo=l  I  log  I  /  /  /Y0;1(yO:l)dyO:l  I 

i=l  j= 1  \Jdi- 1  Jd3~  1  J  \Jdi-l  Jdj-l  J 


fdL  Id'.,  /Y0:1(yo:i)  dyo-.l 
IdUfY0{yo)dy0 


rdi 


(N-  l>£  .  fy0(y o)  dyo  I  log  I  /  fYo (y0) dyo  j  , 


i- 1 


fdi 


/ 


di-i 


(3.79) 


where  in  the  last  step  we  have  simplified  the  double  integral  in  the  second  term  by  first 
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interchanging  the  order  of  the  inner  summation  and  the  outer  integration,  and  then  by 
integrating  out  the  superfluous  variable  y\. 

Finally,  if  we  take  the  optimal  values  of  the  HMM  output  densities  from  (3.70)  and 
substitute  them  back  into  the  derived  expression  for  (given  in  (3.69)),  we  find  that  the 
maximum  value  of  X>3,  conditioned  on  the  value  of  d,  is  given  by 


*-»±r  /«.(!/o)iog(-3^o) 

\  Jdj  _  1  JYoi 

L '  rdj 

=  NY1  fy< b  (yo)  log  fy0  (yo )  dyo 

p  Jdj-l 

-Nic  Iy0 (yo )  log  ^ jf  fy0(u)  duj  dyo 

L  rdj 

=  nY2  fvo  (yo)  log  fYo (yo)  dyo 

P  Jdj- i 

-'5  (C.  fro  (yo )  dyoj  log  ^  jT  fYo  (yo )  dyo 


,(it)  du 


(3.80) 


(3.81) 


(3.82) 


Let  us  now  reconstruct  the  maximum  conditional  value  of  the  overall  objective  function 
by  calculating  the  sum  of  the  three  components  above.  This  yields  the  expression 


V  =  VX+V2  +  D3 


L  L  (  rdi  rdi  \ 

=  (N~1)'52^2[  /  /y0:i  (you)  dy0:l  • 

1=1  j=  1  \Jdi- 1  Jdj- 1  ) 

(  f^j  \ 

log  /  /  /y0;1(yo:i)cfyo:i  ] 

\^J  d{-.  i  J  dj— i  J 

-  2(N  -  1)  ^  {jd  fYo  (yo)  dyoj  log  ^  fYo  (yo)  dyo 


L  rdi 

~Nz2  /Vo (yo)  log  fy0 (yo)  dyo. 


PiJd>- 


(3.83) 


(3.84) 


Observe,  however,  that  the  last  term  on  the  right-hand  side  above  can  be  written  as 

■k  rdi  rco 

fYo  (yo)  log  fy0  (yo)  dyo  =  N  I  fYo  (y0)  log  fy0  (yo)  dy0,  (3.85) 

i— j  Jdi- 1  J— oo 


and  is  therefore  invariant  with  respect  to  the  breakpoint  vector  d.  This  allows  us  to  drop 
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the  final  term  from  the  objective  function  V ,  and  subsequently  remove  the  factor  of  N  —  1 
that  multiplies  the  two  remaining  terms.  We  can  then  restate  our  present  goal  as 

d*  =  argmax  "D'^d),  (3.86) 

— oo=do<di  <--<d.L= oo 

where  V"  is  the  new  objective  function  given  by 


V"{d)  =  (JV  -  1) 


L 

-E 

j= i 
L 

-E 

i=i 

Although  the  above  expression  for  V"(d)  may  at  first  seem  cumbersome  (particularly  in 
view  of  the  fact  that  the  last  two  terms  are  equal  and  could  therefore  be  consolidated),  we 
have  written  it  in  such  a  way  that  it  can  now  be  easily  identified  as  a  familiar  information- 
theoretic  measure.  In  particular,  V"(d)  represents  the  mutual  information  between  any 
pair  (@t,  ©t+i)  of  successive  state  variables  in  the  underlying  Markov  chain  whose  param¬ 
eter  values  are  conditionally  optimal  given  d.  Hence,  maximizing  is  equivalent  to 

maximizing  /(©*,  ©t+i),  the  mutual  information  between  the  discrete  random  variables  ©t 
and  ©t+i- 

To  see  this  more  directly,  let  us  assume,  without  loss  of  generality,  that  the  variables  in 
question  are  ©o  and  ©i-  Then  for  a  given  value  of  d,  the  marginal  pmfs  for  these  random 
variables,  which  we  denote  by  Po  and  Pi,  respectively,  are  given  by6 


PoO;d)  = 

fdj 

/  fY0{yo)dyo  j  =  1,2,  ••  •  ,L 

Jdj- 1 

(3.88) 

Pi(i;  d)  = 

rdi 

/  fYi(vi)dyi  *  =  1,2,---  ,L 

Jdi- 1 

(3.89) 

and  their  joint  pmf  is  given  by 

R&j;  d)  =  [dt 

J  di— i  . 

rdj 

/  /y0:i  (yo:i)  dyo-A  i,j  =  1, 2,  •  •  •  ,  L. 

fdj- 1 

(3.90) 

L  L  /  rdi  rdj 

S]C(/  /  /Vo=i(yO:l)<fyO:l 

i=l  j= l  \Jdi- 1  Jdj- 1  > 


(  rdi  rdi 

log  /  /  /Yo:i(yO:l)dyO:l 

\J di-l  Jdj- 1  . 


/  /i-o(yo)  dyo 

4-1  ) 


Mdyo 


{^jd  fvi (yi ) dyi^j  log  Iyi (yi) dy^j  . 


(3.87) 


60f  course,  from  our  earlier  derivation,  we  know  that  Po  and  Pi  must  be  identical  (owing  to  the  sta- 
tionaritv  constraint),  but  we  nonetheless  use  distinct  symbols  here  to  emphasize  the  mutual  information 
concept. 
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By  substituting  these  expressions  back  into  (3.87),  we  can  write  the  objective  function  much 
more  plainly  as 


L  L 

=  EE^i;d)log^>i;d) 

i= 1  j= 1 

L  L 

+  Po(j;  d)  log  Po(j;  d )  +  Y^Pi  (*';  d)  log  Pi  (t;  d)  (3.91) 

j= i  i=i 

=  /(0o,©i;d).  (3.92) 

This  representation  makes  it  clear  that  we  should  choose  the  value  of  the  breakpoint  vector 
d  that  yields  the  largest  possible  mutual  information  between  successive  state  variables  of 
our  HMM.  While  there  exists  no  general  closed-form  solution  for  such  a  value,  we  now  have 
a  very  useful  rule  for  finding  an  optimal  set  of  breakpoints. 


3.4  Generation  of  Numerical  Approximations 

We  now  demonstrate  that  an  optimal  HMM-based  representation  of  a  particular  AR  signal 
can  be  constructed  using  a  special  gradient-descent  technique  derived  in  Appendix  E.  At 
the  core  of  our  example  lies  the  true  signal  to  be  approximated,  which,  for  the  purpose  of 
analytical  tractability,  we  have  chosen  to  be  a  first-order  AR  Gaussian  process.  We  compare 
realizations  generated  by  several  different  approximations  of  this  process  and  present  tables 
of  the  parameters  characterizing  each  HMM.  In  addition,  at  the  end  of  the  section,  we 
examine  the  probability  of  error  of  an  optimal  detector,  which  is  designed  to  determine 
whether  it  has  been  given  a  realization  of  the  true  process  or  an  approximate  process. 

3.4.1  Statistical  Characterization  of  the  True  Random  Process  {1^} 

Throughout  this  section,  we  assume  that  the  true  source  signal  {Yi}  obeys  the  first-order 
linear  difference  equation  given  by 


Yt  =  oYt-i  +  Wt ,  (3.93) 

where  a  is  a  real  number  satisfying  —  1  <  a  <  1  and  {Wt}  is  a  sequence  consisting  of  i.i.d. 
Gaussian  random  variables  whose  pdf  fw(-)  is  given  by 

fw{w)  =M(w,  0,1).  (3.94) 

Observe  that  we  can  equivalently  describe  the  process  {Yi}  as  the  output  of  a  linear  time- 
invariant  system  whose  impulse  response  {/it}  is  the  discrete-time  sequence  defined  by 


h 


0,  t  <  0, 
a*,  t  >  0, 


(3.95) 


Chapter  3.  Approximating  Stationary  Signals  with  Finite-State  Markov  Models 


83 


and  whose  input  is  the  white  Gaussian  noise  sequence  {Wt}-  The  constraint  imposed  on  the 
autoregressive  parameter  a  insures  that  the  system  described  by  the  above  impulse  response 
is  stable,  and  hence  that  the  output  of  the  system,  {Yt},  is  stationary, 


Prom  our  discussion  in  earlier  sections,  we  know  that,  in  order  to  solve  for  the  best 
parameter  values  of  any  HMM-based  representation  of  {Yf},  we  will  require  the  marginal 
pdf  for  the  random  variable  Yt  as  well  as  the  joint  pdf  of  the  pair  of  random  variables 
(Yt,  Y-fi).  Since  {Yt}  is  zero-mean,  both  of  these  densities  are  completely  characterized 
by  their  second-order  moments.  Let  us  first  calculate  the  variance  of  the  single  random 
variable  Yt.  Using  the  fact  that  the  elements  of  {Wt}  are  statistically  independent  and  have 
unit  variance,  we  can  write 


Var  ^afcWt_fe[ 

(3.96) 

U=o  J 

oo 

5>fc)2Var{Wt_fc} 

(3.97) 

k—0 

1 

1  -  a2' 

(3.98) 

Therefore,  the  pdf  for  the  random  variable  Yt  can  be  expressed  as 


/r‘(y)  =  ^^eXP{4(1"aV}- 


(3.99) 


Next,  let  us  solve  for  the  covariance  of  the  pair  of  random  variables  (Yt,  Yt+i).  Using  the 
autoregressive  equation  that  relates  these  two  variables,  we  have 


Cov  {Yt,  Yt+1}  =  E  {Yt(aYt  +  Wt)} 

=  aVar{Yt} 
a 

1  —  a2 


(3.100) 

(3.101) 

(3.102) 


From  the  results  in  (3.98)  and  (3.102),  we  conclude  that  the  covariance  matrix  C  of  the 
random  vector  Yt:f+i  must  have  the  form 


1  a 


1  -  a2  1  -  a2 
a  1 


L  1  —  a2  1  —  a2  J 

It  is  straightforward  to  show  that  the  determinant  of  this  matrix  is  given  by 


(3.103) 


1  —  a2 


(3.104) 
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1.00 


Table  3.1:  Diagrammatic  representation  and  parameter  definitions  for  1-state  HMM-based 
approximation  of  the  AR  Gaussian  process  Yt  =  0.8Yt_i  +  Wt. 


and  that  its  inverse  is 


cr1  = 


1  —a 
—a  1 


(3.105) 


Using  these  expressions,  we  can  express  the  pdf  of  Y*:t+i  as 

Vl  -  a2 


/Yt;t+1(y)  =  -^exp{-Iy^[_1a  “jy  }• 


(3.106) 


3.4.2  Descriptions  of  Various  HMM-Based  Approximations  of  { Yt } 

To  conduct  our  finite-state  modeling  experiment,  we  arbitrarily  selected  the  parameter  value 
a  =  0.8.  Then,  using  the  expressions  for  the  signal  densities  given  in  (3.99)  and  (3.106),  to¬ 
gether  with  the  numerical  optimization  technique  derived  in  Appendix  E,  we  created  several 
distinct  HMM-based  approximations  for  the  true  signal.  The  single  factor  distinguishing 
these  approximations  was  the  number  of  states  making  up  the  underlying  Markov  chain 
within  each  HMM. 

Our  collection  of  approximations  consisted  of  a  1-state  HMM,  a  2-state  HMM,  a  3-state 
HMM,  a  5-state  HMM,  and  a  7-state  HMM.  Complete  parametric  descriptions  of  these 
finite-state  approximations  are  given  in  Tables  3.1,  3.2,  3.3,  3.4,  and  3.5,  respectively.  Note 
that  each  of  the  first  three  tables  have  been  augmented  with  a  corresponding  diagram  of  the 
components  of  the  HMM  so  that  its  dynamics  and  its  output  can  be  more  easily  visualized. 
Such  diagrams  become  rather  unwieldy  for  higher-order  models,  however,  so  they  have  been 
omitted  from  the  remaining  two  tables.  We  remark  in  addition  that  the  probabilities  listed 
in  each  table  have  been  rounded  to  two  decimal  places  for  succinctness.  Thus,  although 
certain  probabilities  appear  to  be  equal  to  zero,  all  probabilities  are,  in  fact,  strictly  positive. 
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i 

di 

P(i) 

Q(v) 

0 

1 

2 

—  C X) 

0.00 

00 

0.50 

0.50 

0.79  0.21 

0.21  0.79 

Table  3.2:  Diagrammatic  representation  and  parameter  definitions  for  2-state  HMM-based 
approximation  of  the  AR  Gaussian  process  Yt  =  0.81t_i  -I-  Wt. 


0.58 


0.70  0.70 


i 

di 

P(i) 

1  Q(v)  ”i 

0 

1 

— oo 

—0.86 

0.30 

0.70 

0.28 

0.02 

2 

0.86 

0.40 

0.21 

0.58 

0.21 

3 

OO 

0.30 

0.02 

0.28 

0.70 

Table  3.3:  Diagrammatic  representation  and  parameter  definitions  for  3-state  HMM-based 
approximation  of  the  AR  Gaussian  process  Yt  =  0.8Yt-i  +  Wt. 
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i 

di 

P(i) 

Q(v)  ! 

0 

1 

— oo 

-1.89 

0.13 

0.58 

0.34 

0.07 

0.01 

0.00 

2 

-0.59 

0.23 

0.18 

0.45 

0.30 

0.07 

0.00 

3 

0.59 

0.28 

0.03 

0.25 

0.44 

0.25 

0.03 

4 

1.89 

0.23 

0.00 

0.07 

0.30 

0.45 

0.18 

5 

OO 

0.13 

0.00 

0.01 

0.07 

0.34 

0.58 

Table  3.4:  Parameter  definitions  for  5-state  HMM-based  representation  of  the  AR  Gaussian 
process  Yj  =  0.8Yf_ i  +  Wt- 


i 

di 

P(i) 

<?(*',•) 

0 

— OO 

— 

— 

— 

— 

— 

— 

— 

— 

i 

-2.51 

0.06 

0.52 

0.34 

0.12 

0.02 

0.00 

0.00 

0.00 

2 

-1.40 

0.13 

0.16 

0.38 

0.31 

0.12 

0.03 

0.00 

0.00 

3 

-0.45 

0.19 

0.04 

0.21 

0.35 

0.27 

0.11 

0.02 

0.00 

4 

0.45 

0.21 

0.01 

0.08 

0.24 

0.34 

0.24 

0.08 

0.01 

5 

1.40 

0.19 

0.00 

0.02 

0.11 

0.27 

0.35 

0.21 

0.04 

6 

2.51 

0.13 

0.00 

0.00 

0.03 

0.12 

0.31 

0.38 

0.16 

7 

OO 

0.06 

0.00 

0.00 

0.00 

0.02 

0.12 

0.34 

0.52 

Table  3.5:  Parameter  definitions  for  7-state  HMM-based  representation  of  the  AR  Gaussian 
process  Yt  =  0.8it-i  +  W$. 


Some  additional  notable  aspects  of  our  HMM-based  approximations  are  highlighted  in 
Figures  3-4  through  3-8.  Each  of  these  figures  characterizes  a  particular  finite-state  model 
from  the  set  of  five  models  that  were  constructed;  the  information  presented  in  each  figure 
allows  us  to  better  understand  the  underlying  dynamics  of  the  associated  HMM  and  to 
compare  the  attributes  of  the  output  of  the  HMM  to  that  of  the  true  Gaussian  process. 

In  the  top  portion  of  each  figure,  we  show  a  contour  plot  of  the  critical  bivariate  Gaus¬ 
sian  pdf  from  (3.106)  as  well  as  a  plot  of  its  univariate  projection  (i.e.,  the  associated 
marginal  pdf),  whose  functional  form  is  given  in  (3.99).  Superimposed  on  each  of  these 
plots  is  a  collection  of  lines  corresponding  to  the  optimal  breakpoints  determined  for  the 
model;  these  lines  help  us  to  see  how  both  the  original  one-dimensional  state  space  and  the 
two-dimensional  coordinate  plane  were  segmented  into  appropriate  regions  of  integration. 
In  the  bottom  portion  of  each  figure,  we  show  realizations  from  the  HMM  and  from  the 
true  Gaussian  process,  as  well  as  corresponding  scatter  plots  that  were  constructed  from 
successive  sample  values  occurring  within  each  realization.  These  two  sets  of  plots  allow  us 
to  quickly  assess,  by  means  of  a  direct  visual  comparison,  the  quality  of  a  given  HMM-based 
representation  of  the  original  signal. 

From  the  plots  shown  in  Figure  3-4,  we  can  examine  the  key  characteristics  of  our  1-state 
approximation.  Recall  that  any  1-state  HMM  is  inherently  a  memoryless  process,  i.e.,  the 
random  variables  that  make  up  the  process  exhibit  no  temporal  dependence  whatsoever. 
This  fundamental  property  of  the  1-state  model  is  clearly  evident  in  both  the  realization 
and  the  scatter  plot  shown  in  Figure  3-4(b).  Specifically,  note  from  the  scatter  plot  in 
this  example  that,  although  the  marginal  distribution  of  the  samples  from  each  realization 
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appear  closely  matched,  the  dispersion  pattern  of  the  points  about  the  origin  is  approxi¬ 
mately  the  same  in  all  directions  —  an  indication  that  successive  samples  of  the  process  are 
independent. 

In  Figure  3-5,  we  can  see  the  temporal  correlation  structure  of  the  approximate  process 
begin  to  take  shape,  owing  to  the  the  dynamics  of  its  underlying  2-state  Markov  chain. 
Nonetheless,  the  output  of  the  2-state  HMM  still  appears  to  be  a  rather  coarse  represen¬ 
tation  of  the  true  process,  since  it  is  capable  only  of  switching  back  and  forth  between  an 
approximate  positive  random  value  and  an  approximate  negative  random  value. 

As  we  can  see  from  Figures  3-6,  3-7,  and  3-8,  however,  the  finite-state  approximation 
gets  progressively  better  as  more  states  are  incorporated  into  the  model.  In  particular, 
from  the  plots  shown  in  Figure  3-8,  we  see  that  the  7-state  HMM  is  capable  of  producing  a 
realization  that  is  nearly  indistinguishable  —  at  least  in  its  observable  statistical  attributes 
—  from  a  realization  of  the  true  random  process. 


Figure  3-4:  Modeling  of  the  AR  Gaussian  process  It  =  0.8Yf_i  -I-  Wt  using  a  1-state  HMM: 

(a)  (left)  contour  plot  of  bivariate  Gaussian  pdf  for  the  pair  of  random  variables  ( Yt ,  Yt+ 1); 
(right)  plot  of  its  univariate  projection,  the  marginal  pdf  for  the  single  random  variable  Yt; 

(b)  (left)  realization  yo:999  from  1-state  HMM;  (right)  corresponding  scatter  plot  of  the  pairs 
(yt,Vt+ 1);  (c)  (left)  realization  yo:999  from  true  AR  Gaussian  process;  (right)  corresponding 
scatter  plot  of  the  pairs  (yt,  yt+i)- 


& 


>•*. 


Figure  3-5:  Modeling  of  the  AR  Gaussian  process  Yj  =  0.8Yt_i  +  Wt  using  a  2-state  HMM: 

(a)  (left)  contour  plot  of  bivariate  Gaussian  pdf  for  the  pair  of  random  variables  (It,  Yt+i); 
(right)  plot  of  its  univariate  projection,  the  marginal  pdf  for  the  single  random  variable  Yi; 

(b)  (left)  realization  yo:999  from  2-state  HMM;  (right)  corresponding  scatter  plot  of  the  pairs 
(yt.  y£+i);  (c)  (left)  realization  yo:999  from  true  AR  Gaussian  process;  (right)  corresponding 
scatter  plot  of  the  pairs  (yt,yt+i)- 
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Figure  3-6:  Modeling  of  the  AR  Gaussian  process  Yt  =  0.8Yj_i  -f  Wt  using  a  3-state  HMM: 

(a)  (left)  contour  plot  of  bivariate  Gaussian  pdf  for  the  pair  of  random  variables  (Yt,  Ft+i); 
(right)  plot  of  its  univariate  projection,  the  marginal  pdf  for  the  single  random  variable  Yt; 

(b)  (left)  realization  y0;999  from  3-state  HMM;  (right)  corresponding  scatter  plot  of  the  pairs 
(yt,yt+  i);  (c)  (left)  realization  yo:999  from  true  AR  Gaussian  process;  (right)  corresponding 
scatter  plot  of  the  pairs  (yt,yt+ 1). 


Figure  3-7:  Modeling  of  the  AR  Gaussian  process  Yt  =  0.8Yt_i  +  Wt  using  a  5-state  HMM: 

(a)  (left)  contour  plot  of  bivariate  Gaussian  pdf  for  the  pair  of  random  variables  (Yt,  Yt+i); 
(right)  plot  of  its  univariate  projection,  the  marginal  pdf  for  the  single  random  variable  Yt; 

(b)  (left)  realization  y0:999  from  5-state  HMM;  (right)  corresponding  scatter  plot  of  the  pairs 
(yt,yt+ x);  (c)  (left)  realization  y0;999  from  true  AR  Gaussian  process;  (right)  corresponding 
scatter  plot  of  the  pairs  ( yt,yt+i )• 
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3.4.3  Verification  that  Approximation  Improves  with  Model  Order 


While  it  is  useful  to  assess  the  relative  quality  of  various  finite-state  models  —  as  we  have 
just  done  with  the  aid  of  Figures  3-4  through  3-8  —  it  is  also  very  important  to  verify  such 
comparisons  by  using  quantitative  methods  that  are  well  understood.  In  this  brief  section, 
we  demonstrate,  using  a  classical  quantitative  test,  that  our  earlier  qualitative  assessment 
was  correct,  i.e.,  that  the  finite-state  approximations  do  indeed  become  progressively  better 
as  more  states  are  added  to  the  underlying  Markov  chain. 

Specifically,  let  us  consider  the  following  experiment:  Suppose  we  are  given  an  observa¬ 
tion  of  length  N ,  and  we  know  that  it  is  a  realization  of  either  the  true  AR  Gaussian  process 
defined  earlier  or  an  HMM-based  approximation  of  this  process.  Moreover,  we  know  that 
these  possibilities  are  equally  likely  to  be  true.  All  parameters  of  both  processes,  including 
the  finite-state  model  order,  L ,  are  assumed  known.  We  wish  to  determine,  with  minimum 
probability  of  error,  which  process  gave  rise  to  the  observation. 

It  is  well  known  that  the  best  test  to  apply  in  this  case  is  the  likelihood  ratio  test  (LRT). 
If  we  denote  the  realization  by  zo;jv-i5  and  we  let  Hq  and  H\  represent  the  hypotheses  that 
the  realization  was  generated  by  the  approximate  and  true  processes,  respectively,  then  the 
LRT  for  this  situation  can  be  expressed  as 

Declare  |  j  when  4(z0:/v-i)  j  <  j  Mz0:.v-i),.  (3.107) 

where  ^o(')  and  £i(-)  are  the  log-likelihood  functions  associated  with  the  approximate  and 
true  processes,  respectively,  and  are  given  by 


N- 1 


4(zo:;v— 0  =  logP(6>(z0))  +  logQ(6(zt-i),6(zt)) 

t=  i 

N- 1  2  L 

-  ATlog(v/2W)  -  E  ^2-  -  E  :N-l)P(j) 


(3.108) 


£_0  ,=i 


and 


£l(Z0:JV— l)  =  ~  log(\/27TC7Y)  - 


N- 1 


—  (n  —  i)  logfv^w)  -  £ 

t= i  Z(Tw 


(3.109) 


Here  we  have  used  the  notation  Nj(zo-.N-i)  to  represent  the  number  of  occurrences  in  the 
realization  zo-.n-i  of  a  value  that  would  have  been  generated  in  state  j  of  the  HMM. 

An  analytical  expression  for  the  probability  of  error,  which  we  denote  by  Pe  rr,  is  very 
difficult  to  obtain  for  this  problem.  Thus,  we  have  resorted  to  approximating  -Pen-  by 
applying  the  above  LRT  in  a  series  of  experimental  trials.  Specifically,  after  fixing  values 
for  L  and  N,  we  generated  a  total  of  10000  realizations  (of  which  half  were  from  the  true 


Figure  3-8:  Modeling  of  the  AR  Gaussian  process  Yt  =  0.8Yj_i  +  Wt  using  a  7-state  HMM: 

(a)  (left)  contour  plot  of  bivariate  Gaussian  pdf  for  the  pair  of  random  variables  (Yt,  Yt+i); 
(right)  plot  of  its  univariate  projection,  the  marginal  pdf  for  the  single  random  variable  Yt; 

(b)  (left)  realization  yo:999  from  7-state  HMM;  (right)  corresponding  scatter  plot  of  the  pairs 
(yt,yt+ 1);  (c)  (left)  realization  yo:999  from  true  AR  Gaussian  process;  (right)  corresponding 
scatter  plot  of  the  pairs  (yt,yt+ 1). 
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Figure  3-9:  Results  of  applying  an  optimal  detector  to  determine  whether  an  observed 
data  sequence  is  a  realization  of  the  true  AR  Gaussian  signal  Yt  =  0.81t_i  +  Wt  or  an 
approximation  to  this  process  by  an  L-state  HMM.  Plots  of  probability  of  error  versus  L 
are  shown  for  observation  lengths  of  N  =  12,  N  =  25,  and  N  =  50. 


process  and  the  other  half  from  the  approximate  process)  and  applied  the  test  in  (3.107)  to 
each  realization.  The  fraction  of  incorrect  decisions  made  by  the  optimal  test  served  as  the 
estimate  of  Perr. 

In  Figure  3-9,  we  present  a  plot  of  the  probability  of  error  versus  the  number  of  states 
included  in  the  HMM.  A  total  of  three  curves  are  shown  on  this  plot,  corresponding  to  the 
cases  in  which  the  length  of  the  observed  sequence  was  12  samples,  25  samples,  or  50  samples. 
As  we  can  clearly  see  from  these  curves,  the  HMM-based  approximations  do  indeed  become 
better  as  L  increases;  that  is,  a  high-order  approximation  leads  to  greater  confusion  on 
the  part  of  an  optimal  detector  than  does  a  relatively  low-order  approximation.  We  note  in 
addition  that  no  pair  of  curves  plotted  in  this  figure  ever  intersect;  this  demonstrates  simply 
that,  for  a  fixed  model  order  L,  the  ability  of  an  optimal  processor  to  discriminate  between 
the  true  and  approximate  processes  increases  uniformly  (or  equivalently,  Perr  decreases 
uniformly)  as  the  number  of  observed  samples,  N,  increases. 

3.5  Generalization  of  the  Optimal  Solution  to  Higher-Order 
AR  Signals 

Thus  far  in  the  chapter,  we  have  focused  our  attention  almost  exclusively  on  the  problem  of 
approximating  a  first-order  stationary  AR  process.  For  this  first-order  case,  we  found  that 
the  state  vector  of  the  associated  dynamical  system  was  merely  a  scalar,  and  that  the  state 
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space  was  therefore  just  the  real  line.  In  fact,  it  was  precisely  because  of  this  uncomplicated 
structure  that  we  were  drawn  to  the  first-order  approximation  problem  as  a  starting  point; 
it  made  certain  theoretical  concepts  easy  to  visualize  and  to  understand. 

As  we  shall  soon  discover,  however,  a  number  of  subtleties  and  complexities  crop  up  as  we 
begin  to  consider  problems  of  higher  dimension.  In  this  final  section,  we  discuss  many  of  the 
important  issues  that  arise  when  we  attempt  to  apply  the  concepts  developed  for  the  first- 
order  case  to  the  approximation  of  AR  processes  having  order  K  >  1.  We  remark  in  addition 
that  much  of  the  material  in  this  section  is  presented  purely  for  heuristic  purposes;  our  main 
objective  is  to  identify  and  understand  the  issues  involved  in  higher-order  problems,  rather 
than  to  develop  techniques  for  generating  concrete  numerical  solutions  to  these  problems. 

3.5.1  Specification  of  the  State-Space  Partition 

Recall  from  our  earlier  discussion  that,  when  using  the  finite-state  approach  to  approximate 
the  dynamics  of  a  one-dimensional  system,  we  first  decomposed  the  real  line  into  a  collection 
of  L  disjoint  intervals  and  then  created  a  one-to-one  correspondence  between  this  collection 
of  intervals  and  the  set  of  L  states  of  the  approximating  Markov  chain.  The  intervals 
themselves  could  be  readily  specified  via  the  L  +  1-dimensional  tuple  of  ordered  breakpoints 
(do,^  •  •  •  ,  d/,),  in  which  the  first  and  last  elements  were  constrained,  respectively,  by  the 
equations  do  =  — oo  and  dx,  =  -f  oo.  Once  values  were  specified  for  the  remaining  L  —  1 
elements  in  the  tuple,  corresponding  expressions  could  be  written  down  immediately  for  the 
optimal  values  of  the  HMM  parameters,  i.e.,  for  the  initial  state  probabilities  {P(i)}^_1,  the 
state  transition  probabilities  {Q(z,  j)}^=1,  and  the  output  densities  The  search 

for  the  best  values  of  the  remaining  L  —  1  breakpoints  formed  the  core  of  the  approximation 
problem. 

In  the  higher-order  case,  however,  the  state  space  is  ,  rather  than  R,  and  the  relevant 
subsets  of  the  state  space  thus  become  full-fledged  regions,  rather  than  mere  intervals. 
Consequently,  a  partitioning  of  this  higher-dimensional  space  into  L  disjoint  regions  can  no 
longer  be  accomplished  simply  by  specifying  the  values  of  L  —  1  numbers  on  the  real  line  as 
before.  Instead,  we  must  now  specify  a  collection  of  contours  or  surfaces  that  would  form 
the  region  boundaries  within  R*\  We  demonstrate  this  notion  in  Figure  3-10  for  the  case  in 
which  K  =  2.  Clearly,  even  in  this  two-dimensional  case  the  definition  of  region  boundaries 
is  considerably  more  complex  than  it  was  in  the  one-dimensional  case. 

Indeed,  even  if  a  suitable  description  of  the  region  boundaries  could  be  found  (say, 
for  example,  some  low-order  polynomial  description),  then  for  any  given  set  of  boundary 
values  we  would  still  be  left  with  the  problem  of  evaluating  the  best  corresponding  set  of 
HMM  parameters.  This  in  turn  could  only  be  accomplished  through  the  involved  procedure 
of  integrating  the  signal  pdf  over  irregularly  shaped  regions  in  a  multi-dimensional  space. 
Moreover,  even  if  this  latter  step  could  be  achieved  using  a  reasonable  amount  of  compu¬ 
tation,  we  would  then  require  a  method  for  determining  the  optimal  set  of  boundaries,  i.e., 
the  partition  of  R^  that  ultimately  yields  the  best  finite-state  approximation  of  the  actual 
signal  according  to  the  Kullback-Leibler  distance  metric. 

It  is  straightforward  to  show,  however,  that  the  fundamental  optimization  principle 
guiding  the  search  for  the  best  partition  remains  exactly  the  same  for  the  case  K  >  1  as 
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Figure  3-10:  Partitioning  of  a  two-dimensional  state  space  into  L  disjoint  regions. 


it  was  for  the  case  K  =  1;  in  particular,  to  find  the  optimal  partition  of  R*' ,  we  should 
adjust  the  boundaries  of  its  L  constituent  regions  so  as  to  maximize  the  mutual  information 
between  successive  state  variables  of  the  underlying  Markov  chain. 


3.5.2  Evaluation  of  the  HMM  Parameters 


Let  us  suppose  for  the  moment  that  a  tractable  description  of  the  region  boundaries  (or 
equivalently,  of  the  regions  themselves)  is  available,  so  that  we  may  concentrate  on  the 
subsequent  step  of  finding  the  best  HMM  parameters  associated  with  a  particular  set  of 
boundaries.  Throughout  this  section,  we  will  assume  that  a  partition  of  the  state  space 
has  already  been  specified,  and  we  will  denote  the  L  regions  that  make  up  the  space  by 
•  •  •  ,Hl- 

By  using  the  same  basic  principles  of  optimality  that  were  applied  in  the  first-order 
problem,  we  readily  conclude  that  the  best  choice  for  the  ith  element  of  the  initial  state 
pmf  for  the  underlying  Markov  chain  is  given  by 

P(i)  =  [  fYut+K-A  y)dy,  (3.110) 

JyeK-i 


i.e.,  it  is  equal  to  the  unconditional  probability  that  the  original  state  vector  Xf  will  lie  in 
the  region  7Z{.  Furthermore,  we  conclude  that  the  best  choice  for  the  output  pdf  associated 
with  the  ith  state  of  the  Markov  chain  is  given  by 


/«(*) 


jp(i/Yt:t-*+1^ 


if  x  €  Hi, 
otherwise, 


(3.111) 
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i.e.,  it  is  defined  to  be  nonzero  only  on  the  region  Hi,  but  on  this  region  it  is  a  scaled  version 
of  the  original  pdf  for  Xj.  If  we  prefer  to  work  instead  with  the  univariate  output  densities 
{ft(‘)}£=i>  we  can  derive  the  optimal  choices  for  these  from  their  multivariate  counterparts 
{fi(-)}i~i  defined  above.  In  particular,  the  optimal  output  densities  are  given  by 

9i(y)=p TTy/"  fYt.^K+1{y,yt-i:t-K)dyt-i:t-K •  (3.112) 

*\l)  J (y,yt-i:t-K)€'R-i 

With  regard  to  these  univariate  densities,  we  note  that  there  is  a  significant  difference 
between  the  case  in  which  K  >  1  and  the  previously  considered  case  in  which  K  —  1. 
Specifically,  in  contrast  to  the  first-order  case,  the  collection  of  functions  {ft  (')}£=i  may  now 
have  overlapping  regions  of  support  on  the  real  line,  despite  the  fact  that  the  original  regions 
in  state  space  are  disjoint.  For  example,  in  Figure  3-10  we  can  see  that  the  projections  of 
Region  1  and  Region  2  onto  the  real  line  will  certainly  not  be  disjoint. 

The  above  solutions  for  P{i)  and  /*(•)  (or,  equivalently,  for  P(i)  and  ft(-))  are  straight¬ 
forward  extensions  of  the  results  we  obtained  for  the  case  K  =  1.  However,  the  solution  for 
the  critical  state  transition  probability  Q{i,j)  is  somewhat  less  direct.  Recall  that  Q{i,j) 
can  be  expressed  as 


Q(i,j)  =  tyjjy,  (3.113) 

where  R(i,j)  =  Pr{0t  =  i,©t+i  =  j}-  It  therefore  suffices  to  find  an  expression  for 
We  know  from  our  earlier  analysis  that,  in  terms  of  the  dynamics  of  the  original  process, 
the  optimal  value  for  the  quantity  R(i,j )  is  simply  the  joint  probability  that  the  following 
two  events  will  occur: 

(i)  Xt  €  Hi 

(ii)  Xt+i  €  TZj 

To  get  a  concise  expression  for  this  joint  probability,  let  us  first  go  back  to  the  definitions 
of  Xt  and  Xf+i,  which  are  given  by 


Xt  =  (yt,yt_i,.-  (3.114) 

Xt+i  =  (Yt+i,Yt,  •  -  •  ,  Yt-K+ 2)-  (3.115) 

Now,  if  we  insist  on  the  condition  Xt  G  Hj,  but  place  no  constraints  on  the  subsequent 
sample  in  the  process,  Yt+i,  then  for  the  augmented  state  vector  (Yt+i,Xt)  we  have  that 

(Yt+i,Xt)  6lx  TZi.  (3.116) 

Similarly,  if  we  impose  the  restriction  Xf+i  €  7 Zj,  but  place  no  constraints  on  the  preceding 
sample  in  the  process,  Yt-K+i,  then  for  the  augmented  vector  (Xt+i,y_/f+i)  we  have  that 


PQ+i,  Yt-K+\)  €  IZj  x  M. 


(3.117) 
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But  observe  that  since 


(Yt+i,Xt)  =  (Xt+uYt-K+i)  =  (Yt+uYt,---  ,Yt.K+ 1),  (3.118) 

we  can  immediately  combine  (3.116)  and  (3.117)  to  obtain  the  consolidated  condition 

(Yt+i,Yu--,Yt-K+i)£Kij,  (3-119) 

where  7 Zij  denotes  the  region  in  R^+1  given  by  the  set  intersection 


Tlij  =  (Rx  IZi)  fl  (7 Zj  x  R) .  (3.120) 

Prom  this  last  result,  we  finally  have  that 

R(iJ)=[  fYt.,+K(  y)dy.  (3.121) 

Jyefiij 

This  is  the  appropriate  extension  of  our  result  from  the  first-order  case,  in  which  we  had 

R(i,j)=  [  fY,,+1(y)dy.  (3.122) 

J  —  2  Mi  j  X  [di  —  i  jdj  ] 

The  issues  that  arise  in  the  approximation  of  higher-order  autoregressive  processes  will 
also  be  encountered  in  Chapter  4.  There,  we  address  the  related  modeling  problem  of 
finding  optimal  HMM  parameter  values  under  the  assumption  that  we  axe  given  a  finite 
number  of  observations  of  the  true  AR  random  process,  rather  than  the  actual  pdf  of  the 
process.  Since  the  true  AR  process  may  have  order  K  >  1,  obtaining  a  solution  will  require 
practical  methods  for  representing  irregularly  shaped  regions  in  a  high-dimensional  space, 
as  well  as  methods  for  estimating  densities  and  integrals  of  densities  over  such  regions. 


3.6  Discussion 

3.6.1  Necessity  of  Constraints  for  Finite-State  Approximation 

In  the  formulation  of  the  signal  approximation  problem  stated  at  the  beginning  in  the  chap¬ 
ter,  we  imposed  a  number  of  rather  stringent  constraints  on  our  finite-state  signal  model. 
Here  we  attempt  to  explain  why  these  constraints  were  needed  to  make  the  approximation 
problem  mathematically  tractable. 

First,  recall  that  we  made  extensive  use  of  the  state-space  partitioning  constraint,  the 
purpose  of  which  was  to  enforce  a  mapping  between  the  L  disjoint  regions  making  up  a 
partition  of  the  original  state  space  and  the  L  states  of  the  signal  model.  To  maintain 
logical  consistency  with  this  constraint,  we  also  defined  the  support  set  for  the  output 
pdf  associated  with  each  state  to  be  precisely  the  same  as  the  region  assigned  to  that  state; 
hence,  the  support  sets  themselves  were  not  allowed  to  overlap.  When  taken  together,  these 
constraints  enabled  us  to  infer  the  state  of  the  Markov  chain  unambiguously  from  the  HMM 
output  at  each  time  step,  and  they  therefore  greatly  simplified  the  resulting  optimization 
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problem.  It  is  important  to  note  that  if  these  constraints  had  not  been  imposed  on  the 
model,  the  optimization  problem  might  have  been  intractable.  If,  for  example,  we  had 
allowed  the  support  sets  for  the  HMM  output  densities  to  overlap,  then  in  order  to  calculate 
the  probability  of  a  given  output  sequence,  we  would  have  had  to  account  for  every  possible 
underlying  state  sequence  of  the  Markov  chain  that  could  have  produced  this  output.  This 
may  well  have  consumed  a  prohibitive  amount  of  computation,  since  the  number  of  such 
state  sequences  grows  exponentially  with  the  observation  length. 

We  also  imposed  the  constraint  that  the  underlying  dynamics  of  our  finite-state  ap¬ 
proximation  were  to  be  structured  in  the  form  of  a  Markov  chain.  While  this  constraint 
was  intuitively  appealing  since  the  true  signal  was  in  fact  known  to  be  Markov,  it  too  was 
introduced  to  make  the  optimization  problem  tractable.  Recall  that  we  had  already  de¬ 
fined  a  quantization  function,  #(■),  to  effect  the  mapping  between  the  original  state  space 
associated  with  the  true  signal  and  the  finite  state  set  associated  with  the  approximation. 
Therefore,  in  the  absence  of  the  Markov-chain  constraint,  the  most  direct  method  of  creating 
a  finite-state  approximation  of  the  true  state  vector  sequence  {Xt}  would  simply  have  been 
to  apply  this  quantization  function  to  each  vector  in  the  sequence.  This  would  have  created 
an  entirely  new  random  process  (0(X*)}  in  which  each  element  assumed  values  on  the  set 
{1, 2,  •  •  *  ,  L};  however,  this  new  process  would  not  necessarily  possess  the  Markov  property. 
Thus,  when  evaluating  the  probability  of  a  given  finite-state  sequence  under  this  approach, 
we  would  not  have  realized  the  computational  savings  made  possible  by  the  Markov-chain 
constraint. 

3.6.2  Alternative  Criteria  for  Assessing  Approximation  Quality 

The  criteria  we  proposed  earlier  in  the  chapter  for  determining  the  quality  of  an  approxi¬ 
mation  —  i.e.,  the  maximin  probability-of-error  criterion  and  the  Kullback-Leibler  distance 
metric  —  were  based  on  the  principle  that  we  should  match  the  shapes  of  the  true  and 
approximate  signal  densities  in  an  appropriate  way.  Such  criteria  are  useful  from  the  stand¬ 
point  of  general-purpose  signal  representation;  that  is,  the  criteria  themselves  are  actually 
independent  of  any  specific  signal  processing  task,  but  nonetheless  allow  us  to  develop  a 
signal  model  that  is  likely  to  perform  reasonably  well  at  most  tasks,  provided  that  the  order 
of  the  model  is  sufficiently  high. 

Sometimes,  however,  it  may  be  desirable  to  use  other,  more  task-specific  ways  of  as¬ 
sessing  approximation  quality.  In  the  context  of  a  well  defined  signal  processing  task,  the 
most  logical  criteria  for  determining  what  makes  the  best  signal  approximation  are  usually 
obvious  once  we  know  how  the  approximation  is  going  to  be  used  in  performing  the  task. 
Consider,  for  example,  the  task  of  signal  estimation,  in  which  the  objective  is  to  produce 
an  estimate  of  the  source  signal  {Yj}  that  has  been  corrupted  by  an  independent  additive 
noise  process  {Vt}.  Suppose  that  the  noise  has  a  simple  statistical  description  (e.g.,  it  is 
white  and  Gaussian)  and  that  {Yj}  is  our  HMM-based  representation  of  {Yt}.  Furthermore, 
suppose  that,  under  the  structural  constraints  imposed  by  this  HMM  assumption,  an  op¬ 
timal  processor  to  the  signal  estimation  problem  can  easily  be  derived.  Let  us  denote  by 

Yt(Z;SEr)  the  value  generated  by  such  a  processor  at  time  t,  where  Z  is  our  finite-length 
corrupted  observation  and  Vf  is  a  multi-dimensional  tuple  representing  all  of  the  parameters 
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needed  to  completely  specify  the  HMM  for  {Ft}.  Then,  if  the  signal  estimate  we  desire  is 
an  MMSE  estimate,  we  should  select  the  optimal  tuple  of  HMM  parameters  according  to 
the  rule 

*  =  argmin E  { (Yt( Z;  ¥)  -  Yt)2}  ,  (3.123) 

* 

where  V  is  the  collection  of  all  tuples  that  yield  a  valid  HMM-based  signal  approximation. 
Once  we  know  the  value  of  the  optimal  tuple,  we  would  then  have  a  precise  characterization 
of  the  best  approximate  signal  {Yt}. 

If  an  approximation  can  be  obtained  by  solving  (3.123),  its  performance  in  the  signal 
estimation  problem  is  guaranteed  to  be  at  least  as  good  as  the  performance  achieved  by 
any  other  approximation  in  V,  including  an  approximation  which  is  optimal  under  the 
Kullback-Leibler  measure.  However,  the  solution  to  (3.123)  may  not  perform  well  when 
applied  to  a  task  other  than  signal  estimation,  simply  because  any  two  distinct  signal 
processing  tasks  can  have  radically  different  objective  functions.  In  view  of  this  lack  of 
“task  robustness,”  we  may  require  a  number  of  different  signal  approximations,  one  for 
each  task  that  must  be  performed.  These  might  include,  for  example,  the  tasks  of  detection, 
classification,  enhancement,  or  compression.  A  task-dependent  approach  of  this  kind  may 
indeed  yield  better  overall  performance,  but  it  will  also  lead  to  a  considerable  increase  in 
complexity  in  the  design  of  a  signal  processing  system.  In  particular,  each  new  signal  model 
obtained  would  require  additional  storage,  and  all  of  the  models  together  would  have  to 
be  manipulated  by  a  higher-level,  centralized  decision  system,  whose  purpose  would  be  to 
determine  which  model  is  appropriate  for  the  particular  task  at  hand. 

3.6.3  Advantages  and  Limitations  of  Using  HMMs 

We  have  seen  that  a  hidden  Markov  model  can  be  used  to  approximate  a  real-world  signal 
concisely  by  capturing  the  dynamical  behavior  of  the  signal’s  most  important  statistical 
features;  consequently,  it  has  the  potential  of  greatly  simplifying  any  signal  analysis  that 
must  be  performed.  In  addition,  if  the  true  signal  is  known  to  be  stationary  and  can  be 
described  as  the  output  of  a  known  dynamical  system,  then  it  is  clear,  at  least  in  principle, 
that  an  HMM  could  be  designed  to  represent  the  signal  with  arbitrarily  high  accuracy 
(via  the  quantization  approach  described  earlier)  by  partitioning  the  state  space  infinitely 
finely.  Another  clear  benefit  of  using  HMMs  is  that  their  mathematical  properties  have 
been  investigated  by  researchers  from  widely  varied  disciplines  over  a  period  of  more  than 
three  decades;  as  a  result,  such  models  are  now  very  well  understood.  Moreover,  a  number 
of  sophisticated,  computationally  efficient  algorithms  have  already  been  developed  for  many 
signal  processing  applications  involving  HMMs. 

On  the  other  hand,  the  use  of  HMMs  for  signal  approximation  also  has  a  number  of 
limitations.  These  limitations  must  also  be  taken  into  account,  since  the  set  of  signals 
that  we  might  attempt  to  approximate  with  HMMs  is  virtually  limitless.  We  begin  by 
observing  that  an  HMM  is  capable  of  modeling  temporal  dependencies  in  the  true  signal 
only  to  the  extent  allowed  by  its  coarse,  finite-state  dynamical  structure,  i.e.,  its  underlying 
Markov  chain.  Among  all  conceivable  dynamical  models,  the  Markov  chain  possesses  one 
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of  the  simplest  of  memory  structures  available,  second  only  to  that  of  a  process  that  is 
entirely  memoryless.  In  contrast,  the  signal  that  is  being  approximated  might  be  extremely 
complex,  and  it  might  not  even  obey  the  Maxkov  property.  Thus,  perhaps  the  most  serious 
limitation  of  HMMs  is  that  the  probabilistic  structure  of  an  HMM  may  be  too  simplistic  to 
capture  the  intricate  detail  in  the  true  signal  —  detail  that  might  be  critical  in  successfully 
performing  the  signal  processing  task  for  which  the  HMM  will  be  used. 

There  are  undoubtedly  many  cases  in  which  an  HMM-based  approximation  would  be 
suitable,  provided  that  we  allow  a  sufficient  number  of  states  in  its  Markov  chain  and  a 
sufficient  number  of  degrees  of  freedom  to  describe  each  of  its  output  densities.  In  such  cases, 
however,  we  may  discover  another  of  the  potentially  serious  limitations  of  using  HMMs, 
namely  that  the  specification  of  an  HMM  could  require  solving  for  an  enormous  number  of 
parameters.  For  example,  it  is  not  inconceivable  that  an  HMM-based  representation  of  a 
given  signal  may  require  as  many  as  50  states,  and  that  specifying  the  output  pdf  in  each 
state  may  require  as  many  as  10  parameters.  This  means  that  we  would  need  50  initial 
probabilities,  2500  state  transition  probabilities,  and  500  pdf  parameters  in  order  to  specify 
the  HMM  completely.  Even  if  we  could  find  optimal  values  for  more  than  3000  parameters, 
the  resulting  model  would  be  considered  impractical  in  many  signal  processing  situations. 

Fortunately,  as  we  will  discover  in  Chapter  5,  it  is  not  always  necessary  or  desirable  to 
incorporate  a  large  number  of  parameters  into  an  HMM  simply  so  that  we  can  represent 
the  true  signal  at  its  finest  level  of  statistical  detail.  Quite  the  contrary,  in  certain  signal 
processing  applications,  a  rather  coarse  HMM-based  approximation  of  the  true  signal  is 
adequate  to  achieve  nearly  optimal  performance. 


Chapter  4 

Building  Finite- State  Markov 
Models  from  Observations 


4.1  Introduction 

In  Chapter  3,  we  considered  the  problem  of  how  to  find  the  best  HMM-based  representation 
of  a  stationary  random  signal  given  exact  knowledge  of  the  signal  pdf.  This  was  a  useful 
starting  point  for  our  analysis  of  HMMs  because  it  compelled  us  to  consider  how  finite-state 
modeling  should  be  performed  when  complete  information  about  the  true  signal  is  available. 
In  a  real-world  signal  processing  situation,  however,  we  rarely  have  such  a  large  amount  of 
prior  knowledge  about  any  of  the  signals  that  make  up  the  measurement.  Thus,  although 
our  analysis  from  the  previous  chapter  produced  a  number  of  useful  theoretical  guidelines 
for  finite-state  modeling,  certain  assumptions  associated  with  the  approximation  problem 
addressed  there  were  somewhat  unrealistic. 

In  this  chapter,  we  adopt  a  more  practical  viewpoint  and  consider  the  problem  of  how 
to  construct  an  HMM-based  representation  of  the  true  random  process  when  we  have  only 
a  finite-length  observation  of  this  process  available,  rather  than  a  complete  probabilistic 
description  of  it.  Thus,  the  problem  we  consider  here  is  essentially  one  of  HMM  source 
identification.  In  the  following  two  subsections,  we  give  an  outline  of  the  assumptions  and 
notation  that  will  be  used  in  connection  with  the  HMM  source  identification  problem,  and 
we  provide  a  concise  formulation  of  the  problem  itself.  In  the  third  subsection,  we  describe 
how  the  remaining  material  in  the  chapter  is  organized. 

4.1.1  Preliminary  Assumptions  and  Notation 

In  the  latter  part  of  Chapter  3,  we  discussed  some  of  the  complexities  involved  in  the  finite- 
state  modeling  of  AR  processes  having  order  greater  than  one.  Clearly,  we  must  address  the 
issues  raised  in  that  discussion  before  developing  our  source  identification  algorithm.  We 
pointed  out,  for  example,  that  arbitrary  multi-dimensional  regions  in  a  state-space  partition 
cannot  be  easily  represented  or  manipulated  with  finite  memory  and  computing  resources. 
In  addition,  the  optimal  output  densities  associated  with  the  states  of  the  HMM  (which 
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Figure  4-1:  Depiction  of  a  typical  Voronoi  partition  in  two-dimensional  space. 


were  expressed  directly  in  terms  of  the  true  signal  pdf  in  Chapter  3)  are  free  of  restrictions, 
and  hence  may  also  require  a  large  amount  of  storage  to  represent  accurately. 

A  further  complication  is  that,  in  addition  to  addressing  the  issue  of  computational 
complexity,  we  must  also  deal  with  the  issue  of  uncertainty,  since  the  exact  signal  pdf  is 
unknown.  In  particular,  observe  that  quantities  such  as  the  optimal  initial  state  probabilities 
and  state  transition  probabilities  of  the  Markov  chain,  as  well  as  the  mutual  information 
between  successive  state  variables  in  the  chain,  are  all  unambiguously  defined  when  the 
pdf  of  the  true  random  process  is  given.  In  the  present  case,  however,  these  quantities  will 
have  to  be  estimated  using  only  a  finite-length  signal  realization.  In  the  remainder  of  this 
subsection,  we  describe  the  techniques  that  will  be  used  to  address  these  issues. 

We  will  represent  regions  in  R.K  efficiently  using  a  simple  geometric  construction  known 
as  a  Voronoi  partition  [152].  A  Voronoi  partition  having  L  regions  can  be  completely 
characterized  by  L  distinct  points  in  the  space.  Let  us  denote  these  points  by  ci,C2,  •  •  •  ,Cl 
and  their  corresponding  regions  by  1Zi,TZ2,  •  , TZl-  The  region  7 Zj  is  defined  by 

Tlj  =  {x  €  R^  |  D(x,  c j)  <  D(x,  ct),  z  =  1,2,---  ,  L}  (4.1) 

where  the  notation  D{x,  c)  represents  the  Euclidean  distance  between  the  points  x  and  c. 
In  other  words,  the  set  7 Zj  contains  all  points  in  R^  that  are  closer  to  the  point  c j  than  to 
any  other  point  Cj,  i  ^  j.  We  will  occasionally  refer  to  the  region  TZi  as  a  Voronoi  region 
and  to  its  associated  point  cz  as  the  anchor  point  for  the  region.  An  example  of  a  randomly 
generated  Voronoi  partition  in  two-dimensional  space  is  shown  in  Figure  4-1.  In  view  of 
the  simple  rule  given  in  (4.1),  we  see  that  a  major  advantage  of  using  this  construction 
is  savings  in  memory  and  computation;  at  the  end  of  the  chapter,  we  will  discuss  several 
limitations  associated  with  using  this  type  of  partition. 
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At  certain  stages  of  our  source  identification  algorithm,  we  will  need  to  know  which 
region  within  the  current  Voronoi  partition  contains  the  data  point  xt.  For  this  purpose  it 
is  convenient  to  introduce  a  class  label  for  the  tth  data  point,  which  we  denote  by  uj(.  This 
class  label  will  take  exactly  one  of  the  values  in  the  set  {1, 2,  -  •  •  L}.  Once  a  set  of  anchor 
points  {ci,  C2,  •  •  •  ,  C£,}  has  been  fixed,  the  value  of  u:t  is  defined  according  to  the  formula 

u>t=  argmin  JD(cj,xt).  (4.2) 

je{i,2,-,£} 

The  tuple  consisting  of  all  such  class  labels  will  be  denoted  by 


Ft  =  (u;o,a>i,---  ,u>jv-i) 


(4.3) 


and  will  be  referred  to  as  the  classification  sequence. 


Suppose  we  have  fixed  a  set  of  anchor  points  and  have  performed  the  categorization  by 
region  described  above.  In  order  to  assess  the  quality  of  the  partition  represented  by  this 
set  of  anchor  points,  we  must  then  estimate  the  value  of  mutual  information  associated  with 
this  categorization.  Recall  from  Chapter  3  that  for  this  calculation  we  require  values  for 
both  the  joint  and  marginal  pmfs  for  the  state  variables  of  the  underlying  Markov  chain. 
We  will  use  empirical  estimates  for  these  pmfs  given  by 

1  *_1 

J  =  1,2,  •  •  •  ,L  (4.4) 

t= o 


and 


R(i,j-,St) 


1 

N-  1 


N—l 

t- 1 


3  —  1;  2,  •  *  •  ,  A, 


(4.5) 


where  7 j-(-)  and  'Yij(-)  are  binary- valued  indicator  functions  defined  respectively  by 


and 


TjM 


1  if  lj  =  j 
0  otherwise 


(4.6) 


7ij(u-l,L02) 


1  if  (wi,W2)  =  (i,j) 
0  otherwise 


(4.7) 


Observe  from  (4.4)  and  (4.5)  that  we  have  expressed  the  pmf  estimates  with  an  explicit 
dependence  on  the  classification  sequence  Ft.  These  pmf  estimates  measure,  respectively, 
the  fraction  of  times  that  the  symbol  j  occurs  in  this  classification  sequence  and  the  fraction 
of  transitions  of  the  form  (i,j)  that  occur  in  the  sequence.  Using  these  pmf  estimates,  we 
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can  compute  the  associated  estimate  of  the  mutual  information  given  by 


L  L 


i=  1  j= 1 


p(*;n)P0';n)' 


(4.8) 


Because  our  goal  is  to  find  the  classification  sequence  Cl  that  maximizes  this  measure  of 
mutual  information,  we  will  often  refer  to  the  function  /(•)  as  the  objective  function  in  the 
remainder  of  the  chapter. 

Finally,  we  will  assume  for  convenience  that  the  output  pdf  associated  with  the  ith  state 
of  our  HMM  is  a  Gaussian  mixture  having  Mi  constituent  elements,  and  is  defined  by 

Mi 

9i(y)  =  — oo  <  y  <  oo.  (4.9) 

7=1 


4.1.2  Problem  Statement  and  Approach  to  Solution 

We  will  assume  that  the  autoregressive  order  of  the  signal,  K,  the  number  of  states  in  the 
HMM-based  signal  approximation,  L,  and  the  number  of  components  in  each  Gaussian- 
mixture  output  pdf,  Mi,  are  precisely  known.  In  addition,  we  assume  that  we  have  a 
finite-length  realization  of  the  source  signal  given  by 

y-K+l:N-l  =  (y-K+1,  y-K+2,  ■  ■  ,  yN- 1)-  (4.10) 

Under  the  same  state  vector  definition  used  in  Chapter  3,  i.e.,  Xt  =  (Yt,  Yt-\,  ■  •  •  ,  Yt-K+ 1), 
the  assumption  that  we  have  the  sequence  of  signal  values  above  is  equivalent  to  the  as¬ 
sumption  that  we  have  the  sequence  of  state  vector  values  given  by 

X0:JV-1  =  (x0,xi,--  -  , Xjv — i ) ,  (4.11) 

since  y~K+i:N-i  could  be  reconstructed  perfectly  from  xo:jv_i  and  vice  versa.  From  this 
sequence  of  N  data  points  in  Ff-dimensional  space,  we  wish  to  estimate  the  parameter 
values  of  the  best  HMM-based  representation  of  the  source  signal;  these  include  the  values 
of  the  initial  state  probabilities  and  state  transition  probabilities  of  the  underlying  Markov 
chain,  as  well  as  the  means,  standard  deviations,  and  weighting  coefficients  of  the  Gaussian- 
mixture  densities  associated  with  the  states  of  the  chain.  It  is  understood  that  the  HMM  to 
be  estimated  must  satisfy  both  the  stationary  constraint  and  the  state-space  partitioning 
constraint  described  in  Chapter  3. 

Our  approach  to  finding  the  best  HMM-based  representation  of  the  source  signal  will 
be  to  construct  an  iterative,  ad  hoc  algorithm  which  implements  the  theoretical  guidelines 
established  in  Chapter  3.  We  decompose  the  algorithm  into  two  basic  parts:  (i)  estimation  of 
the  optimal  Voronoi  partition  of  the  state  space;  and  (ii)  estimation  of  the  HMM  parameters 
based  on  this  optimal  partition.  To  solve  the  first  of  these  two  subproblems,  we  develop  an 
algorithm  which  selects  an  appropriate  initial  state-space  partition  and  then  systematically 
adjusts  the  boundaries  of  this  partition  to  increase  the  value  of  the  objective  function  (i.e., 
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the  mutual  information  between  successive  state  variables  of  the  Markov  chain)  at  each 
step.  Once  the  best  partition  has  been  reached,  the  state-vector  realizations  contained  in 
each  region  of  the  partition  are  used  to  generate  empirical  estimates  of  the  HMM  parameter 
values. 

4.1.3  Chapter  Organization 

The  chapter  consists  mainly  of  a  description  of  the  components  of  our  HMM  source  iden¬ 
tification  algorithm.  First,  we  describe  a  procedure  for  finding  the  best  partition  of  the 
state  space  associated  with  the  observed  signal;  then,  we  describe  how  this  partition  can 
be  used  to  estimate  the  parameters  of  the  Markov  chain  as  well  as  the  parameters  of  the 
densities  associated  with  the  states  of  the  chain.  At  the  end  of  the  chapter,  we  discuss  a 
number  of  advantages  and  limitations  of  our  algorithm,  and  we  identify  several  open  issues 
as  potential  directions  for  future  work. 


4.2  Estimation  of  the  Optimal  State-Space  Partition 

The  purpose  of  the  first  part  of  our  HMM  source  identification  algorithm  is  to  estimate 
the  best  Voronoi  partition  of  the  state  space  based  on  the  given  sequence  of  state-vector 
realizations  xo;jv-i-  This  part  of  the  algorithm  is  made  up  of  three  stages:  (i)  selection  of  a 
suitable  initial  Voronoi  partition  of  the  state  space,  (ii)  iterative  refinement  of  the  Voronoi 
partition,  and  (iii)  termination  of  the  iterative  procedure.  We  describe  each  of  these  parts 
of  the  algorithm  in  the  following  three  subsections.  In  Figure  4-2,  we  show  plots  of  the 
output  produced  by  this  part  of  the  algorithm  in  a  particular  source  identification  problem. 
For  the  case  depicted  here,  the  true  AR  signal  has  order  two,  and  therefore  the  state  space 
is  two-dimensional. 

4.2.1  Selection  of  an  Initial  Partition 

We  specify  an  initial  Voronoi  partition  of  our  fT-dimensional  state  space  by  choosing  values 
for  the  L  anchor  points  ci,C2,  •  •  •  ,C£,.  To  simplify  the  selection  process,  we  restrict  the 
anchor  points  to  lie  in  the  given  set  of  N  state  vector  realizations  {xo,xi,---  ,xjv-i}. 
In  particular,  we  use  a  randomized  procedure  whereby  we  choose  each  of  the  L  anchor 
points  successively  from  the  set  of  N  realizations,  without  replacement,  assuming  after  each 
selection  that  all  remaining  realizations  are  equally  likely  candidates.  This  procedure  has 
the  advantage  that  it  yields  an  approximate  random  sample  from  the  true  marginal  pdf 
of  the  state  vector,  provided  that  N  is  large  relative  to  L  and  to  the  dependence  length 
associated  with  the  original  random  signal  {Vf}. 

Once  the  initial  partition  has  been  specified  in  this  way,  we  can  assess  the  quality  of 
the  partition  by  computing  its  associated  objective  function  value.  We  evaluate  the  objec¬ 
tive  function  by  performing  the  following  three  steps:  first,  we  categorize  each  state  vector 
realization  according  to  its  region  number  using  the  minimum  distance  formula  in  (4.2); 
then,  from  the  resulting  classification  sequence,  O,  we  compute  the  empirical  estimates 
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(c)  (d) 

Figure  4-2:  Voronoi  partition  at  various  stages  of  the  iterative  refinement  algorithm:  (a) 
initial  partition;  (b)  partition  after  5  iterations;  (c)  partition  after  10  iterations;  (d)  partition 
after  20  iterations. 
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{P(i;  and  {R(i,j;  using  (4.4)  and  (4.5);  finally,  we  calculate  the  corre¬ 

sponding  estimate  of  mutual  information  using  (4.8). 

This  process  of  selecting  a  set  of  anchor  points  and  evaluating  the  corresponding  parti¬ 
tion  can  be  repeated  to  find  a  different  set  with  a  higher  objective  function  value.  In  fact,  we 
could  in  principle  find  the  best  such  paxtition  by  testing  all  distinct  sets  of  L  anchor  points 
that  could  be  drawn  from  the  pool  of  N  data  points.  However,  even  for  modest  values  of  L 
and  N,  the  amount  of  computation  required  to  find  the  best  initial  set  using  this  technique 
would  be  prohibitive.  An  attractive  alternative  to  this  exhaustive  search  is  to  choose  a  fixed 
number  of  sets  of  anchor  points  at  random  (say  Jjnit ,  where  Jjnjt  <C  Z\(P-l)\ )’  compute  the 
mutual  information  corresponding  to  each  set,  and  retain  the  set  that  yields  the  largest 
mutual  information  as  the  specification  of  the  initial  partition.  A  detailed  step-by-step 
description  of  this  initialization  procedure  is  given  in  Figure  4-3. 1 


xIn  this  figure  and  in  the  following  two  figures,  we  use  several  new  notational  symbols,  which  we  define 
here.  First,  we  refer  to  the  set  of  time  indices  associated  -with  our  observations  as  T  =  {0, 1,  •  ■  •  ,  N  —  1). 
We  also  use  the  expression  S'  =  RS(5;j)  to  indicate  that  S'  is  a  randomly  selected  subset  of  S  consisting 
of  j  elements.  Finally,  we  define  St  to  be  an  ordered  tuple  of  length  N  whose  tth  element  is  equal  to  1  and 
whose  remaining  elements  are  equal  to  0. 


Chapter  4.  Building  Finite-State  Markov  Models  from  Observations 


[o]  DESCRIPTION: 

Initialize  loop  counter  and  objective  function  value. 
OPERATION: 


\T\  DESCRIPTION: 

Select  L  points  at  random  from  the  given  set  of  N  and  use  these  as  anchor  points 
for  the  next  Voronoi  partition  to  be  tested. 

OPERATION: 

•  ,fi}  * —  RS(T;L) 

cj  < — xtj,  j  =  l,2,--,L 

@  DESCRIPTION: 

Compute  the  class  label  for  each  data  point  by  determining  the  identity  of  the  nearest 
anchor  point. 

OPERATION: 

iot  < —  argmin  D(cj,xt),  t  €  T 
>€{1,2 

n  ■< —  (wo,wi,  •  ■  • ,  w.v-i) 

[T]  DESCRIPTION: 

Calculate  empirical  estimates  of  the  marginal  and  joint  probability  mass  functions 
associated  with  state  variables  of  the  underlying  Markov  chain. 


Figure  4-3:  Description  of  algorithm  for  selecting  an  initial  Voronoi  partition.  ( Continued 
on  following  page.) 
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DESCRIPTION: 

Compute  the  value  of  the  objective  function,  which  is  the  estimated  mutual  informa¬ 
tion  between  successive  state  variables  of  the  Markov  chain. 

OPERATION: 

T  ,  V' V'  Of  A  -n  1 - 


DESCRIPTION: 


Check  if  the  current  value  of  the  objective  function  is  the  largest  value  encountered 
thus  far.  If  so,  store  this  value  as  well  as  the  classification  sequence  that  produced 
it;  if  not,  proceed  to  the  next  step. 


OPERATION: 


if  I  >  /max  then 


goto  [~6"| 


DESCRIPTION: 


Increment  the  loop  counter  and  check  if  a  sufficient  number  of  subsets  of  the  data 
have  been  tested.  If  so,  proceed  to  the  body  of  the  algorithm;  if  not,  test  another 
subset. 

OPERATION: 


if  n  =  Jinit  then 


goto  I  ITERATIVE  REFINEMENT  PROCEDURE  | 


goto  [7] 


Figure  4-3:  ( Continued  from  previous  page.)  Description  of  initialization  algorithm. 


112 


Chapter  4.  Building  Finite-State  Markov  Models  Grom  Observations 


4.2.2  Iterative  Refinement  of  the  Iteration 

To  improve  upon  the  Voronoi  partition  generated  by  the  initialization  procedure,  we  now 
make  a  series  of  small  adjustments  to  the  anchor  points  of  the  partition  in  such  a  way  that 
the  objective  function  value  increases  at  every  step.  This  stage  of  the  algorithm,  which  we 
refer  to  as  the  iterative  refinement  procedure,  is  described  in  detail  in  Figure  4-5. 

Dining  the  iterative  refinement  procedure,  we  remove  our  original  restriction  that  the 
anchor  points  be  elements  of  the  set  of  data  points  {xo,  xi ,  •  •  •  ,  xjv-i},  so  that  the  locations 
of  the  anchor  points  will  now  be  unconstrained.  However,  the  anchor  points  will  not  be 
adjusted  directly  during  the  refinement,  since  this  could  entail  a  computationally  expensive 
search  in  the  vicinity  of  each  anchor  point  within  Kk.  Instead,  at  each  iteration  we  will 
adjust  the  anchor  points  indirectly  by  making  small  changes  to  the  regions  that  they  repre¬ 
sent.  These  changes  to  the  regions  will  in  turn  be  implemented  by  changing  the  class  labels 
of  certain  data  points  contained  in  each  region. 

Thus,  adjustments  to  the  partition  are  ultimately  made  by  reassigning  data  points  to 
different  regions.  Observe  that  several  alternative  class  labels  can  be  hypothesized  for  each 
data  point  in  order  to  find  the  best  label  for  this  point  under  the  current  partition.  (The 
definition  of  the  best  class  label  for  a  particular  point  will  be  given  later;  roughly,  it  is  the 
label  which  is  most  likely  to  yield  the  greatest  increase  to  the  objective  function  value  when 
the  labels  of  all  other  points  are  held  constant.)  To  eliminate  unnecessary  computation,  we 
make  use  of  two  key  principles  at  each  iteration.  First,  we  use  the  principle  that  it  is  more 
important  to  examine  data  points  that  are  near  boundaries  of  the  current  partition  than 
to  examine  those  that  are  far  away;  second,  for  each  point  that  is  examined,  we  use  the 
principle  that  it  is  more  important  to  test  class  labels  corresponding  to  nearby  anchor  points 
rather  than  those  of  distant  anchor  points.  Accordingly,  each  iteration  performed  within 
the  iterative  refinement  procedure  consists  of  the  following  sequence  of  steps:  (i)  finding 
data  points  near  boundaries;  (ii)  finding  anchor  points  in  the  vicinity  of  each  boundary 
point;  (iii)  testing  the  profitability  of  reassigning  each  data  point  to  a  nearby  anchor;  (iv) 
switching  class  labels  of  points  under  test,  if  appropriate;  and  (v)  defining  an  updated  set 
of  anchor  points  based  on  the  new  class  labels. 

The  first  step  of  each  iteration  is  therefore  to  separate  all  of  the  data  points  into  those 
that  are  interior  points  and  those  that  are  boundary  points.  For  each  data  point,  we  make 
this  determination  based  on  the  class  labels  of  the  m  other  points  closest  to  it.  (Here  we 
assume  m  is  an  algorithm  parameter  that  must  be  specified  in  advance.)  In  particular,  if 
all  m  of  the  neighboring  points  have  the  same  class  label  as  the  point  under  test,  then  this 
point  is  defined  to  be  an  interior  point.  Otherwise,  the  point  is  defined  to  be  a  boundary 
point.  We  show  examples  of  interior  and  boundary  points  in  Figure  4-4.  If,  after  this 
boundary  test,  a  given  data  point  has  been  classified  as  an  interior  point,  then  its  class 
label  will  remain  fixed  for  the  current  iteration.  However,  if  the  point  has  been  classified 
as  a  boundary  point,  then  at  least  one  alternative  class  label  will  be  tested.  The  set  of 
candidate  labels  for  this  point  will  be  precisely  those  of  its  m  neighboring  points;  hence, 
the  m  nearest  neighbors  not  only  indicate  whether  a  particular  point  is  a  boundary  point, 
they  also  give  us  a  convenient  way  of  determining  which  anchor  points  are  nearby. 

Suppose  now  that  xt  is  known  to  be  a  boundary  point.  We  wish  to  determine  whether 
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Figure  4-4:  Depiction  of  a  portion  of  the  Voronoi  partition  during  an  iteration  of  the 
algorithm.  In  this  case,  the  five  nearest  neighbors  of  a  given  point  determine  whether  it  is 
an  interior  point,  such  as  xi,  or  a  boundary  point,  such  as  X2- 


the  value  of  the  objective  function  would  increase  or  decrease  as  a  result  of  changing  its  class 
label  ujf  We  will  make  this  determination  under  the  assumption  that  the  class  labels  of  all 
data  points  other  than  xt  remain  fixed.  Under  this  assumption,  the  current  classification 
sequence 


O  =  (wq,  u>i,  -  •  •  •  •  -  ,wn_i) 


(4.12) 


would  change  to  the  new  classification  sequence 

fl'  =  (wo,wi,  -  •  •  ,u>t_i,u4,o;t+x,  •  •  •  ,wjv_i),  (4.13) 

where  w't  represents  a  class  label  (different  from  u>t)  of  one  of  the  m  nearest  neighbors  of 
xt.  To  calculate  the  effect  of  this  change  on  the  value  of  the  objective  function,  we  first 
determine  its  effect  on  the  marginal  pmf  P(-:  Cl)  and  the  joint  pmf  P(-,  •;  ft).  Let  us  express 
the  modified  pmfs  P(-;  Cl')  and  R(-,  •;  Cl')  as  additively  perturbed  versions  of  the  original 
pmfs  given  by 


P(i;  O')  =  P(i;  fl)  +  A P(i;  Cl ,  O') 


(4.14) 


and 


R(i,j-Cl')  =  R{i,j;Cl)  +  AR(i,j;Cl,Cl') 


(4.15) 
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for  i,j  =  1, 2,  •  -  •  ,  L.  Then,  relative  to  the  original  classification  sequence,  the  new  classifi¬ 
cation  sequence  has  one  more  element  with  the  class  label  u't  and  one  fewer  element  with 
class  label  u)t  as  a  result  of  the  change;  hence,  the  perturbation  A P  must  have  the  form 

a P(»;  n,  n')  =  ^M)  -  (4.ie) 


Moreover,  since  the  new  classification  sequence  now  contains  the  transitions  and 

1)  but  no  longer  contains  the  transitions  (ujt-i,vt)  and  (u)t,ut+ 1),  we  conclude  that 
the  perturbation  A R  must  have  the  form 


1  1 
=  -  _  jijiut-uu't)  +  N  _ 


(4.17) 


Working  with  these  simple  perturbations  allows  us  to  quickly  assess  the  effect  on  the  ob¬ 
jective  function,  since  it  essentially  saves  us  from  recalculating  the  marginal  and  joint  pmfs 
using  the  original  formulas  in  (4.4)  and  (4.5).  Once  the  above  perturbations  have  been 
applied  via  (4.14)  and  (4.15),  the  value  of  the  objective  function  I {ft')  resulting  from  the 
change  in  the  classification  sequence  can  now  be  computed  using  (4.8)  with  ft  replaced  by 
O'.  The  net  change  in  the  objective  function  is  given  by 


AI(ft,ft')  =  I{ft')  -  I{ft). 


(4.18) 


If,  for  a  particular  boundary  point,  there  is  only  a  single  alternative  class  label  to  be 
tested  via  the  above  marginal  measure,  and  if  this  new  label  would  yield  an  increase  in 
mutual  information  according  to  (4.18),  then  the  current  label  is  replaced  by  the  new  label; 
otherwise,  the  current  label  is  left  unchanged.  If  there  are  multiple  alternative  labels  to 
be  tested  in  this  way,  then  the  current  class  label  is  replaced  by  the  alternative  label  that 
yields  the  largest  change  in  the  objective  function  (provided  that  this  change  is  greater  than 
zero).  The  remaining  data  points  are  then  processed  in  a  similar  way.  In  the  descriptions 
given  in  Figure  4-5,  we  refer  to  boundary  points  whose  labels  could  be  changed  to  increase 
mutual  information  as  positive-potential  boundary  points. 

The  new  class  labels  for  the  data  points  are  only  tentative,  however.  This  is  because  the 
points  themselves  do  not  necessarily  fall  into  disjoint  Voronoi  regions  under  their  present 
categorization;  their  new  labels  merely  indicate  the  general  directions  in  which  region  bound¬ 
aries  should  be  modified.  Class  labels  become  permanently  reassigned  through  a  two-step 
procedure.  First,  the  anchor  points  of  a  new  Voronoi  partition  are  defined  based  on  the 
tentative  class  labels;  then,  each  data  point  is  once  again  re-labeled  according  to  its  nearest 
anchor  point.  To  obtain  the  new  value  for  the  anchor  point  c j,  we  compute  the  centroid  of 
all  points  xt  having  the  class  label  =  j.  That  is,  Cj  is  now  given  by 

,  EfaO1 7(ut)xt 

-  „JV-1  /  .  X 


(4.19) 
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Once  these  anchors  have  been  computed,  each  new  class  label  u>t  is  calculated  using  (4.2). 

It  is  hoped  that  the  Voronoi  partition  resulting  from  this  change  will  have  a  greater 
objective  function  value  than  did  its  predecessor.  However,  this  is  not  always  the  case  for 
at  least  two  reasons.  First,  the  decision  to  change  a  class  label  is  made  individually  for 
each  boundary  point,  rather  than  jointly  over  all  boundary  points;  second,  the  two-stage 
construction  of  the  partition  at  the  end  of  the  iteration  may  introduce  some  error  through 
the  approximation  of  boundary  surfaces.  (As  we  will  demonstrate  in  the  discussion  at  the 
end  of  the  chapter,  the  final  partition  boundary  is  always  a  portion  of  a  if -dimensional 
hyperplane  under  the  Voronoi  constraint,  even  though  the  most  recent  class  label  changes 
may  collectively  indicate  that  the  boundary  should  be  curved  in  some  way.)  Nonetheless, 
the  iterative  refinement  proceeds  as  described  above  until  the  value  of  the  objective  function 
undergoes  a  decrease,  rather  than  an  increase.  When  this  occurs,  the  termination  procedure 
is  invoked. 
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ITERATIVE  REFINEMENT  PROCEDURE 


Jo]  DESCRIPTION: 

Assign  initial  values  for  the  classification  sequence  and  construct  the  set  of  m  nearest 
neighbors  for  each  data  point. 

OPERATION: 

to  -f— 

r0(f)  < — 

Ti{t •)<— 

Ti(t)  < - 

[7]  DESCRIPTION: 

For  each  data  point,  construct  the  set  of  class  labels  associated  with  its  m  nearest 
neighbors.  Once  this  set  has  been  determined,  remove  from  it  all  class  labels  having 
the  same  value  as  that  of  the  point  itself.  Define  the  set  of  boundary  points  to  be 
those  data  points  having  at  least  one  neighbor  labeled  differently  from  itself. 

OPERATION: 

wr  *—  {wT1  (t)>Wr2(*)>"'  teT 

wr  < —  {uj  e  wr  i  lj  ^  uit}  t  €t 

Tbdy  < —  {*  €  T  I  wtm  #  0} 

[2]  DESCRIPTION: 

For  each  boundary  point,  test  alternative  classifications  for  it  and  determine  which  of 
these  would  result  in  the  greatest  gain  in  the  value  of  the  objective  function,  assuming 
that  the  labels  of  all  other  points  remain  fixed.  Define  the  set  of  positive-potential 
boundary  points  to  be  those  that  would  increase  the  current  value  of  the  objective 
function  if  given  the  best  class  label. 

OPERATION: 

cut'  i — argmax{Air(f2, 12  +  (w  —  uit)St)}  ,  t  £  7bdy; 

7b^y  < -  €  7bdy  |  AI(f2,  f2  +  ( U3t  —  U>t)St)  >  0}  . 


Umax 

t,  teT 

T\{to (£),-•- ,T;-i(t)},  teT;  i  =  ,m 

argmin{D(xT,X6)}  ,  teT ;  £  =  1,2,---  ,  rrt 

r£Tdt) 


Figure  4-5:  Description  of  algorithm  for  iteratively  refining  the  Voronoi  partition.  (Contin¬ 
ued  on  following  page.) 
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_3_ 


0 


6 


DESCRIPTION: 

Define  a  new  classification  sequence  by  replacing  the  current  class  labels  associated 
with  positive-potential  boundary  points  with  the  best  alternative  class  labels. 

OPERATION: 


n'  < —  o  +  (wj  —  ut)s t 

t€7bdy 

DESCRIPTION: 

For  each  set  of  data  points  with  a  given  class  label,  compute  the  centroid  location 
and  make  this  the  new  anchor  point  for  the  class. 

OPERATION: 


Ef=oS(^) 


DESCRIPTION: 


Re-compute  the  class  label  for  each  data  point  by  determining  the  identity  of  the 
nearest  anchor  point. 


OPERATION: 


u't  * —  aigmin  (D(cj,x4)}  ,  teT 

J2'  < -  (ui'0,  Uj[,  ■  ■  ■ 

DESCRIPTION: 

Check  if  the  new  classification  sequence  yields  an  overall  increase  in  the  value  of 
the  objective  function.  If  so,  then  update  the  current  classification  sequence  accord¬ 
ingly  and  perform  another  iteration;  if  not,  proceed  to  the  termination  stage  of  the 
algorithm. 

OPERATION: 

if  Al(«,  n')  >  0  then 

n< —  n' 

goto  [T] 
else 

goto  j  TERMINATION  PROCEDURE  | 

endif 


Figure  4-5:  (Continued  from  previous  page.)  Description  of  iterative  refinement  algorithm. 
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4.2.3  Termination  of  the  Iterative  Refinement  Procedure 

The  termination  procedure  is  designed  to  test  whether  the  most  recent  partition  refinement 
has  yielded  a  local  maximum  or  a  global  maximum  of  the  objective  function.  This  proce¬ 
dure  introduces  random  perturbations  into  the  current  partition  as  a  way  of  probing  the 
parameter  space  and  testing  whether  the  objective  function  value  can  be  further  increased. 
If  a  particular  perturbation  is  successful,  the  termination  procedure  then  sends  control  back 
to  the  iterative  refinement  procedure  for  further  adjustments  to  the  partition.  Thus,  the 
termination  procedure  may  actually  be  invoked  more  than  once  during  a  single  execution 
of  the  overall  algorithm.  We  give  a  detailed  description  of  this  procedure  in  Figure  4-6. 

The  procedure  is  furnished  with  the  identities  of  the  positive-potential  boundary  points 
from  the  most  recent  partition  refinement,  as  well  as  the  corresponding  alternative  class 
labels  that  have  been  deemed  individually  the  most  profitable.  Clearly,  when  the  current 
labels  of  all  of  these  points  are  switched  to  their  best  alternatives  and  the  new  partition 
is  subsequently  specified,  the  objective  function  value  does  not  increase;  otherwise,  the 
termination  procedure  would  not  have  been  invoked.  Therefore,  rather  than  switching  the 
labels  of  all  of  these  points  at  once,  the  termination  procedure  switches  the  labels  of  only 
a  subset  of  these  points;  this  subset  is  chosen  at  random  from  all  possible  subsets  of  the 
positive-potential  boundary  points. 

If  switching  the  class  labels  of  the  particular  subset  selected  does  not  increase  the  value 
of  the  objective  function,  then  another  subset  is  tried.  This  procedure  continues  until  either 
the  objective  function  value  is  increased  or  the  maximum  allowable  number  of  subsets,  say 
Jterm,  is  reached.  (The  number  Jterm  is  an  algorithm  parameter  that  must  be  specified  in 
advance.)  If  the  maximum  number  of  subsets  has  been  tried  and  the  objective  function 
value  has  not  been  increased,  the  algorithm  terminates,  and  the  current  partition  (i.e.,  the 
one  supplied  to  the  termination  procedure  upon  invocation)  is  declared  to  be  the  best  one. 
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[o]  DESCRIPTION: 

Initialize  the  loop  counter. 

OPERATION: 

n  < —  0 

[T)  DESCRIPTION: 

Select  at  random  a  subset  of  the  positive-potential  boundary  points. 

OPERATION: 

{hi}  <—  RS({0, 1};  1),  i  =  1,2,.--,  lTb+y| 
l^dyl 

*■( —  ki 

i=l 

Tbdy  RS  (Tb%-,k) 

0  DESCRIPTION: 

Define  a  new  classification  sequence  by  replacing  the  current  class  labels  associated 
with  the  randomly  selected  subset  of  positive-potential  boundary  points  with  the  best 
alternative  class  labels. 

OPERATION: 


Si'  < —  SI  +  ^2  (<*£  ~  ut)6t 
‘€Tb+dy 


Figure  4-6:  Description  of  algorithm  to  terminate  the  iterative  refinement  procedure.  ( Con¬ 
tinued  on  following  page.) 


120 


Chapter  4.  Building  Finite-State  Markov  Models  from  Observations 


T}  DESCRIPTION: 

Re-compute  the  class  label  for  each  data  point  by  determining  the  identity  of  the 
nearest  anchor  point. 

OPERATION: 

u’t  < —  argmin  {£>(c_,-,xt)} ,  t&T 

j€{  1,2,-+} 

O'  « —  (ui o,  wj,  •  ■  •  ,u'N_i) 

[U  DESCRIPTION: 

Check  if  the  new  classification  sequence  yields  an  increase  in  the  value  of  the  objective 
function.  If  so,  then  update  the  current  classification  sequence  accordingly  and  return 
to  the  iterative  refinement  procedure;  if  not,  then  increment  the  loop  counter  and 
test  another  subset  of  positive-potential  boundary  points. 

OPERATION: 

if  >0then 

n<—  sr 

goto  j  ITERATIVE  REFINEMENT  PROCEDURE  | 
else 

n  < —  n  +  1 

if  n  —  Itenn  then 

end 

else 

goto  pH 

endif 

endif 


Figure  4-6:  ( Continued  from  previous  page.)  Description  of  termination  algorithm. 
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4.3  Estimation  of  Optimal  HMM  Parameters 


Once  a  suitable  state-space  partition  has  been  found  using  the  three-stage  algorithm  de¬ 
scribed  above,  we  can  compute  estimates  of  the  HMM  parameter  values.  In  particular, 
we  need  to  estimate  the  initial  state  probabilities  and  state  transition  probabilities  for  the 
Markov  chain  as  well  as  the  means,  standard  deviations,  and  weighting  coefficients  for  the 
Gaussian-mixture  densities  associated  with  the  states  of  the  chain.  Since  the  estimation  of 
Markov  chain  parameters  is  decoupled  from  the  estimation  of  output  density  parameters, 
the  algorithms  used  to  generate  these  estimates  can  be  executed  in  any  order. 

Let  us  denote  by  ’3'*  the  tuple  of  final  class  labels  for  the  data  points  {xo,  •  ■  •  ,x^-i}- 
To  estimate  the  Markov  chain  parameters,  we  use  the  empirical  formulas  given  in  (4.4) 
and  (4.5),  as  was  done  during  the  search  for  the  best  partition.  However,  an  additional  pro¬ 
cessing  step  is  now  required  to  obtain  the  final  estimates,  namely  the  conversion  of  the  joint 
state  probabilities  ’$r*)}^_1  to  the  state  transition  probabilities  'F*)}f  J=1. 

This  step  can  be  carried  out  via  the  formula 


Q(»,j;**) 


E£=i 


(4.20) 


which  normalizes  the  rows  of  the  array  R  such  that  each  row  becomes  a  valid  pmf. 

Estimating  the  densities  associated  with  the  states  of  the  Markov  chain  is  somewhat 
more  involved.  For  this  problem,  we  use  the  procedure  presented  in  Figure  4-7,  whose 
computational  structure  is  based  on  the  EM  principle.  This  procedure  must  be  applied 
separately  to  the  data  points  in  each  region  of  the  optimal  partition  to  estimate  all  L 
densities.  It  need  not  be  applied  directly  to  the  data  points  in  R^ ,  however.  If  we  wish  to 
build  an  HMM  having  scalar- valued  output  (so  as  to  approximate  a  sequence  of  signal  values 
rather  than  state  vector  values),  we  can  simply  extract  the  first  element  of  each  state- vector 
realization  in  a  given  Voronoi  region  and  proceed  to  estimate  a  univariate  output  pdf  based 
on  these  scalar  measurements.  We  will  concentrate  on  this  univariate  case  here. 

To  estimate  a  given  output  density,  the  algorithm  must  be  supplied  with  initial  estimates 
of  all  of  the  Gaussian-mixture  parameters.  To  generate  these  initial  estimates,  we  could, 
for  example,  try  many  different  randomly  selected  sets  of  mixture  parameters  and  then 
use  the  particular  set  that  yields  the  highest  likelihood  value.  After  the  algorithm  has 
been  initialized,  a  single  iteration  then  proceeds  as  follows.  The  new  estimate  for  the 
weighting  coefficient  associated  with  the  jth  mixture  component  is  defined  to  be  the  average 
posterior  probability  that  each  observation  was  produced  by  component  j.  The  new  estimate 
for  the  mean  of  the  jth  mixture  component  is  defined  to  be  a  weighted  average  of  the 
observed  samples,  where  the  weight  placed  on  the  tth  sample  is  proportional  to  the  posterior 
probability  that  this  sample  was  generated  by  component  j.  The  new  estimate  for  the 
variance  of  the  jth  component  is  a  weighted  average  of  the  squares  of  the  observations  (after 
the  previously  computed  estimate  of  the  mean  of  the  jth  component  has  been  subtracted 
out);  the  weights  used  in  this  calculation  are  precisely  the  same  as  those  used  to  update  the 
jth  mean.  This  iterative  updating  procedure  continues  until  there  is  a  negligible  change  in 
the  tuple  of  estimated  parameter  values  from  one  iteration  to  the  next. 
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PDF  ESTIMATION  PROCEDURE 


DESCRIPTION: 

Assign  initial  values  to  the  means,  standard  deviations,  and  weighting  coefficients  of 
the  M-component  Gaussian  mixture.  In  addition,  assign  values  to  the  N  observed 
samples. 

OPERATION: 


Pi  +- 

7  =  1,2,- 

■  ,M 

Pi  *- 

1 

-2U 

o 

7  =  1,2,- 

-  ,M 

■5 

T 

o 

7  =  1,2,- 

•  ,  M 

<- 

—  (P,er,p) 

yt 

Vt  5 

II 

o 

-  ,N-  1 

DESCRIPTION: 

Using  the  current  pdf  parameter  estimates,  compute  the  posterior  probability  that 
the  tth  observation  came  from  the  jth  mixture  component. 


OPERATION: 


t  =  0,1, ,N-1 
7  =  1, 2,  •  •  •  ,M 


Figure  4-7:  EM  algorithm  for  estimating  parameters  of  HMM  output  pdf.  (Continued  on 
following  page.) 
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DESCRIPTION: 

Update  the  pdf  parameter  estimates  using  the  posterior  probabilities  just  computed. 
Define  the  weighting  coefficient  for  the  j'th  mixture  component  to  be  the  arithmetic 
average  of  all  posterior  probabilities  associate  with  that  component.  Define  the  jth 
mean  to  be  a  weighted  average  of  the  observed  values,  where  the  weights  are  the 
posterior  probabilities  associated  with  the  jth  mixture  component.  Using  these  same 
weighting  terms,  define  the  j'th  variance  to  be  a  weighted  average  of  the  squares  of 
the  mean-adjusted  observed  values. 


OPERATION: 


N-l 


lst= 0 

J  V  Et=ox  Ptj 

*  < —  (j*,<r,p) 


j  =  1,2,---  ,  M 


j  =  1,2,  •  •  •  ,M 


DESCRIPTION: 

Check  if  the  distance  between  the  current  and  previous  parameter  vector  estimates 
is  below  a  predetermined  tolerance  level.  If  so,  then  terminate  the  algorithm.  If  not, 
then  save  the  current  estimate  and  return  to  step  1  and  perform  another  iteration. 


OPERATION: 


if  D( ¥')  <  T  then 


4 —  ¥ 

goto  [T] 


Figure  4-7:  (Continued  from  previous  page.)  EM  algorithm  for  estimating  parameters  of 
HMM  output  pdf. 
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4.4  Discussion 

4.4.1  Prior  Work  in  HMM  Parameter  Identification 

A  number  of  algorithms  have  been  developed  by  other  researchers  for  estimating  the  pa¬ 
rameters  of  an  HMM  under  various  assumptions,  including  a  key  assumption  that  we  have 
used  here,  namely  that  the  output  pdf  for  each  state  of  the  HMM  is  a  Gaussian  mixture. 
The  most  popular  of  these  existing  algorithms  is  referred  to  as  the  Baum- Welch  algorithm, 
which  was  originally  developed  and  analyzed  in  a  series  of  papers  by  Baum  et  al  [20,  22,  23]. 
This  technique  is  now  recognized  as  an  implementation  of  the  EM  algorithm;  it  has  been 
used  extensively  in  the  construction  of  HMM-based  phonetic  models  for  modern  speech 
recognition  systems  [78,  84,  154].  Alternative  algorithms  have  also  been  developed  within 
the  speech  processing  community;  these  include,  most  notably,  the  gradient-based  algo¬ 
rithm  developed  by  Levinson  et  al  [109].  However,  none  of  the  existing  techniques  just 
mentioned  can  be  used  to  solve  the  HMM-based  source  identification  problem  as  we  have 
defined  it  here,  for  these  algorithms  have  been  designed  to  optimize  a  likelihood-based  crite¬ 
rion  rather  than  a  mutual  information  criterion;  moreover,  they  are  not  equipped  to  handle 
the  critical  state-space  partitioning  constraint,  and  they  therefore  generally  produce  HMM- 
based  approximations  which  lie  outside  of  the  set  of  valid  solutions  defined  in  the  chapter 
introduction. 

4.4.2  Quality  Assessment  for  the  Initial  Partition 

We  note  that  the  random  selection  approach  that  we  have  used  in  the  initialization  procedure 
described  in  Section  4.2.1  allows  us  to  assess,  in  a  probabilistic  sense,  the  quality  of  our  initial 
partition  relative  to  all  possible  initial  partitions.  Specifically,  observe  that  each  of  the  L- 
element  subsets  of  {xo,  xi,  •  •  •  ,  x^r-i)  specifies  a  unique  initial  partition  (provided  that  the 
N  original  data  points  are  distinct),  and  that  each  partition  can  in  turn  be  evaluated  using 
its  associated  mutual  information.  In  fact,  these  subsets  can  be  arranged  in  ascending  order 
according  their  objective  function  values.  It  is  straightforward  to  show  that  a  randomly 
chosen  subset  among  these  ordered  subsets  has  an  objective  function  value  in  the  (100a)th 
percentile  with  probability  1  —  a,  where  0  <  a  <  1.  Furthermore,  if  Jjnjt  of  the  subsets  are 
chosen  at  random,  then  the  subset  among  these  with  the  largest  objective  function  value 
lies  in  the  (100a)th  percentile  with  probability  1  —  Therefore,  if  we  wish  to  know, 

for  example,  the  smallest  number  of  initial  subsets  that  should  be  tested  so  that  the  subset 
with  the  largest  objective  function  value  is  in  the  95th  percentile  with  probability  0.9  or 
higher,  we  need  only  solve 


Jinit  =  min{JG  {1,2,3,---}  |  1-0.957  >0.9}  (4.21) 

The  above  minimization  yields  a  value  of  J  =  29;  thus,  in  this  case  we  would  find  the  best  of 
29  L-element  subsets  chosen  at  random  from  the  N  given  data  points.  Alternatively,  rather 
than  starting  with  a  required  exceedance  probability,  we  might  instead  place  a  restriction 
on  the  amount  of  computation  we  wish  to  perform  in  the  initialization  procedure.  Such  a 
restriction  would  translate  directly  into  a  bound  on  Jjnjt.  With  this  bound,  we  could  then 
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assess  the  quality  of  our  partition  by  computing  performance  values  represented  by  the  pair 
(100a,  1  —  aJinil)  for  0  <  a  <  1.  A  similar  quantitative  analysis  to  the  one  just  described 
would  also  apply  to  the  termination  procedure,  since  this  procedure  also  makes  use  of  the 
random  subset  selection  technique. 

4.4.3  Implicit  Constraints  Imposed  by  the  Voronoi  Assumption 

Our  decision  to  use  the  special  Voronoi  construction  for  our  state-space  partition  was  made 
in  order  to  minimize  computational  complexity  and  memory  consumption.  But  this  decision 
actually  places  a  severe  constraint  on  the  form  of  an  admissible  region  within  a  partition. 
In  particular,  any  Voronoi  region  is  inherently  a  convex  set;  hence,  as  we  will  demonstrate 
below,  the  region  boundaries  in  such  a  partition  must  always  be  planar  (i.e.,  portions  of  K- 
dimensional  hyperplanes).  The  cost  incurred  (in  terms  of  sacrificed  approximation  quality) 
as  a  result  of  using  the  Voronoi  structure,  rather  than  a  more  general  partition  structure 
which  could  model  curved  region  boundaries,  is  unknown  and  may  be  extremely  difficult  to 
measure. 

The  convexity  property  of  a  Voronoi  region  can  be  derived  directly  from  its  definition. 
To  see  this,  suppose  the  point  x  €  Si*'  is  known  to  be  an  element  of  the  Voronoi  region  7 Zj, 
so  that  it  satisfies  the  distance  inequalities 

D(x,  Cj)  <  D{x,  Cj),  2  =  1,2,  ••*,£.  (4.22) 

Upon  squaring  both  sides  of  each  inequality,  the  entire  set  of  L  inequalities  continues  to 
hold  and  can  be  expressed  in  the  form 


(x  -  ci)T(x  -  cj)  <  (x  -  Cj)r(x  -  Cj),  i  =  1, 2,  •  •  •  ,L. 


(4.23) 


Though  at  first  it  may  appear  that  the  above  inequalities  are  quadratic  in  x,  in  fact  the 
terms  of  second  order  cancel  each  other;  thus,  the  expressions  are  actually  linear  in  x  and 
can,  after  considerable  algebraic  manipulation,  ultimately  be  written  as 


(x-i(ci  +  ci))r(c:,  -Ci) 


>0, 


2  =  1,2,-**  ,L. 


(4.24) 


In  Figure  4-8,  we  give  a  geometric  interpretation  (in  two-dimensional  space)  of  a  typical 
inequality  from  this  latter  set  of  L  inequalities.  Observe  that  the  ith  inequality  above  is 
actually  imposed  on  the  inner  product  of  two  vectors,  namely  x  =  x  —  ^(c,-  -f  Cj),  which 
is  simply  a  re-expression  of  the  vector  x  relative  to  the  midpoint  between  c,  and  cj,  and 
c  =  (c j  —  c2)/!|cj  —  Ci||,  which  is  the  unit  vector  that  points  in  the  direction  from  c,  to  c j. 
The  inequality  itself  implies  that  only  those  points  x  €  yielding  an  inner  product  that  is 
either  positive  or  zero  can  lie  in  the  region  7 Zj.  In  other  words,  each  of  the  inequalities  above, 
with  the  exception  of  the  trivial  one  in  which  i  =  j,  can  be  thought  of  as  representing  a 
closed  half-space  in  K  dimensions.  The  hyperplane  forming  the  boundary  of  this  half-space 
is  the  plane  that  bisects  the  line  segment  connecting  c,  and  Cj.  It  follows  that  if  we  take  all 
of  the  (nontrivial)  inequalities  simultaneously,  we  have  a  new  representation  of  the  Voronoi 
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Figure  4-8:  Representation  of  a  Voronoi  region  as  an  intersection  of  many  half-spaces.  The 
inner  product  of  x  and  c  must  be  positive  for  any  point  x  in  the  region  TZj. 


region  TZj  as  the  intersection  of  L  —  1  half-spaces;  hence,  the  bounding  surface  enclosing  the 
Voronoi  region  is  formed  from  portions  of  the  associated  L  —  1  bounding  hyperplanes.  Since 
the  line  segment  connecting  any  two  points  in  a  closed  half-space  is  itself  contained  in  the 
half-space,  we  have  that  a  closed  half-space  is  a  convex  set.  Finally,  since  the  intersection 
of  a  finite  number  of  convex  sets  is  a  convex  set,  it  follows  that  a  Voronoi  region  is  convex. 

4.4.4  Assumptions  on  the  HMM  State  Output  Densities 

As  part  of  the  HMM  source  identification  algorithm  described  in  this  chapter,  we  assumed 
that  the  output  pdf  associated  with  each  state  of  the  HMM  was  a  Gaussian  mixture.  Of 
course,  we  know  that  this  assumption  cannot  be  true  in  general,  since  the  region  of  support 
for  a  Gaussian-mixture  pdf  is  all  of  R,  whereas  the  actual  projection  of  a  Voronoi  region  onto 
the  real  line  is  often  a  bounded  interval.  However,  in  practice  this  violation  of  the  assump¬ 
tion  does  not  typically  cause  any  difficulties.  In  addition,  as  we  will  discover  in  Chapter  5, 
the  Gaussian-mixture  assumption  is  convenient  not  only  because  it  can  be  specified  by  a 
small  number  of  parameters,  but  also  because  it  offers  several  practical  advantages  in  the 
signal  estimation  problem  and  other  related  problems. 

Furthermore,  we  can  use  the  EM-based  iterative  algorithm  presented  in  Section  4.3 
to  estimate  the  parameters  of  such  a  density;  this  algorithm  is  very  easy  to  implement 
and  is  reasonably  efficient.  Other  algorithms  (many  of  them  similar  to  the  one  described 
in  Figure  4-7)  for  solving  this  estimation  problem  have  been  proposed  independently  in 
several  different  contexts  and  could  also  be  used  [3,  77,  136,  191].  For  situations  in  which 
the  Gaussian-mixture  assumption  does  not  provide  an  adequate  representation  of  the  true 
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signal,  we  can  choose  from  a  wide  range  of  other,  more  general  density  estimation  methods 
(see,  for  example,  [104,  129,  140,  159,  203,  214,  218,  223,  224]  for  a  number  of  classical 
approaches  to  density  estimation,  or  [170,  178,  198,  199]  for  an  overview  of  more  modern 
techniques. 

4.4.5  Open  Issues  in  HMM  Parameter  Identification 

For  the  purpose  of  developing  our  source  identification  algorithm,  we  assumed  that  the 
parameters  K,L,  Mi, M2,  ■  ■  ■  ,  Ml  were  given.  In  practice,  however,  the  values  of  these 
parameters  are  not  typically  known  and  therefore  must  be  estimated.  More  precisely,  we 
must  estimate  K,  which  is  truly  an  unknown  parameter  of  the  signal,  and  we  must  select 
reasonable  values  for  the  remaining  parameters,  which  are  merely  being  used  to  approximate 
the  signal.  As  we  mentioned  in  Chapter  2,  several  methods  already  exist  for  estimating  the 
autoregressive  order,  K ,  even  in  cases  where  the  signal  has  been  generated  by  a  nonlinear 
system.  However,  there  appear  to  be  no  clear  guidelines  for  choosing  values  of  the  remaining 
model  parameters.  Undoubtedly,  the  most  critical  of  these  remaining  parameters  is  the 
HMM  order,  L,  since  this  parameter  is  the  only  one  that  affects  the  dynamical  structure 
of  the  approximation.  In  general,  the  most  appropriate  choice  for  L  will  depend  on  the 
specific  signal  processing  task  for  which  the  HMM-based  approximation  will  be  used.  A 
potentially  rich  area  for  future  research  is  to  develop  a  technique  for  optimally  selecting  the 
HMM  order,  a  priori,  based  on  a  description  of  the  signal  processing  task,  so  that  a  tedious 
process  of  trial  and  error  can  be  avoided. 


Chapter  5 

Using  Finite-State  Markov  Models 
for  Signal  Estimation 


5.1  Introduction 

In  Chapter  4,  we  developed  a  set  of  practical  numerical  techniques  for  performing  source 
identification  based  on  the  finite-state  signal  model  introduced  earlier.  However,  these 
techniques  constitute  only  a  part  of  our  overall  finite-state  signal  processing  framework  as 
defined  in  the  early  portion  of  the  thesis.  To  complete  the  framework,  we  now  turn  our 
attention  to  the  inference  problem  that  we  have  not  yet  addressed,  namely  the  problem  of 
signal  estimation.  In  the  next  two  subsections,  we  give  some  assumptions  and  notation  that 
will  be  used  in  connection  with  the  signal  estimation  problem,  and  we  provide  a  concise 
formulation  of  the  estimation  problem  itself.  In  the  third  subsection,  we  describe  how  the 
remaining  material  in  the  chapter  is  organized. 


5.1.1  Preliminary  Assumptions  and  Notation 

As  usual,  we  will  assume  that  {Yt}  is  a  stationary  signal  of  interest,  and  that  our  corrupted 
observation  of  this  signal,  {Zt},  consists  of  samples  defined  by 


Zt  =  Yt  +  Vt,  (5.1) 

where  {Vt}  is  a  stationary  noise  process  which  is  statistically  independent  of  {Yj}.  Through¬ 
out  the  chapter,  we  will  assume  that  {Yt}  is  the  output  of  an  L-state  HMM  whose  state 
space  is,  without  loss  of  generality,  the  set  of  integers  {1, 2,  •  •  •  ,  L}.  The  underlying  Markov 
chain  from  which  {Yt}  is  generated  will  be  denoted  by  {©*};  as  in  previous  chapters,  the 
initial  state  probabilities  and  state  transition  probabilities  of  this  Markov  chain  will  be 
denoted,  respectively,  by  {P(i)}£_x  and  {Q(i,  j)}fj=1.  We  denote  the  output  densities  of 
the  HMM  by  Although  these  output  densities  can  in  general  be  arbitrarily  com¬ 

plicated  functions,  in  the  sequel  we  will  assume  that  the  ith  density,  §*(•),  is  a  Gaussian 
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mixture  made  up  of  M*  constituent  elements,  and  is  defined  by 

Mi 

9i(y )  =  Ep^(y^ik,cik),  -oo  <  y  <  00.  (5.2) 

k=l 

Because  we  will  be  examining  several  different  types  of  signal  estimation  problems,  the 
definition  of  the  additive  noise  process  {Vj}  will  be  modified  as  appropriate  as  we  progress 
through  the  chapter.  In  all  cases,  however,  this  noise  process  will  also  be  viewed  as  the 
output  of  a  finite-state  HMM.  In  most  of  the  cases  considered,  {F*}  will  be  made  up  of  i.i.d. 
random  variables;  hence,  the  HMM  representing  {Ft}  will  be  degenerate,  i.e.,  it  will  consist 
of  only  a  single  state.  When  it  becomes  necessary  to  refer  to  the  temporal  dynamics  of  the 
noise,  we  will  use  the  notation  {0[}  to  represent  the  underlying  Markov  chain  from  which 
the  noise  samples  are  generated.  We  will  assume  that  this  Markov  chain  has  V  states,  given 
by  {1, 2,  •  •  •  ,  L and  that  the  initial  state  probabilities  and  state  transition  probabilities  for 
this  chain  are  given  by  {P'(i)}f=1  and  respectively.  As  in  the  HMM-based 

representation  of  the  signal,  each  of  the  output  densities  will  be  taken  to  be  a  Gaussian 
mixture.  The  ith  output  density,  which  we  denote  by  g'(-)  and  which  is  assumed  to  consist 
of  M-  Gaussian  components,  is  defined  by 

M[ 

9i(v )  =  5Z Pik-^i^ /4>  -oo  <  V  <  oo.  (5.3) 

fc=i 


5.1.2  Problem  Statement  and  Approach  to  Solution 

We  assume  throughout  the  chapter  that  only  the  finite-length  portion  of  the  observed  signal 
{Zt}  between  t  =  0  and  t  =  N  —  1  is  available  for  estimating  signal  values  of  interest.  When 
given  a  realization  of  the  random  vector  Zo:.v-i,  our  goal  is  to  attempt  to  determine  the 
value  that  has  been  taken  by  the  underlying  signal  vector  Yo:jv_i.  More  precisely,  our 
objective  is  to  obtain,  for  each  signal  value  yt  (t  —  0, 1,  •  •  •  ,  N  —  1),  the  MMSE  estimate 
yt(z0:7v_i)  defined  by 


yt(zo:iV-i)  =  argminP{(yt  -y(z0;jv-i))2|Z0:iv-i  =  zo:JV-i}  ,  (5.4) 

»(-)ey 

where  y  represents  the  set  of  all  real- valued  functions  of  a  real  AT-dimensional  argument.1 
This  type  of  estimation  problem  is  commonly  referred  to  as  a  smoothing  problem,  and  its 
solution  is  typically  termed  an  optimal  smoother  [14,  164,  193,  197].  For  this  problem,  all 
parameter  values  characterizing  both  the  signal  and  the  noise  are  assumed  to  be  precisely 
known. 

As  we  have  already  pointed  out  in  earlier  chapters,  the  solution  to  (5.4)  is  given  by  the 


xTo  simplify  notation  in  the  remainder  of  the  chapter,  we  will  often  suppress  the  argument  in  the  func¬ 
tional  expression  yt(zo.N-i)  and  use  the  abbreviated  symbol  yt  to  refer  to  the  estimate. 
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conditional  expectation 


yt  =  E  {Ft|Z0:jv-i  =  z0:N-i}  •  (5-5) 

However,  from  the  standpoint  of  generating  a  specific  numerical  estimate  of  the  signal  based 
on  our  observations,  this  symbolic  expression  is  not  immediately  helpful.  Rather,  it  provides 
only  a  starting  point  from  which  we  may  ultimately  derive  a  more  concrete  solution.  Much 
of  the  material  this  chapter  is  aimed  at  exploiting  the  stochastic  structure  of  the  signal  and 
noise  outlined  above  in  order  to  develop  computationally  efficient  techniques  for  evaluating 
the  right-hand  side  of  (5.5). 

5.1.3  Chapter  Organization 

The  chapter  is  organized  in  the  following  way.  First,  we  examine  the  form  of  the  above 
conditional  expectation  in  the  case  where  the  corrupting  noise  is  white  and  Gaussian,  and 
we  describe  an  efficient  recursive  algorithm  for  evaluating  the  expectation  in  this  simple 
case.  We  then  analyze  the  various  components  of  this  estimation  algorithm  to  determine  the 
approximate  number  of  arithmetic  operations  required  to  generate  the  final  signal  estimate. 
Next,  we  compare  the  estimation  performance  achieved  using  several  different  HMM-based 
representations  of  a  stationary  AR  Gaussian  signal,  so  that  we  can  quantify  the  improvement 
in  estimator  quality  as  a  function  of  the  number  of  states  in  the  HMM  and  as  a  function  of 
the  SNR.  We  then  extend  the  basic  estimation  algorithm  developed  for  the  case  in  which  the 
noise  is  white  and  Gaussian  to  more  complex  cases  in  which  the  noise  is  allowed  to  be  both 
non-Gaussian  and  non-white.  We  also  demonstrate  how  these  algorithms  can  be  configured 
to  perform  signal  separation,  i.e.,  the  estimation  of  each  of  several  statistically  independent 
non-Gaussian  signals  that  have  been  additively  combined.  Finally,  we  describe  how  the 
finite-state  modeling  paradigm  can  be  effectively  applied  to  signal  processing  problems  other 
than  smoothing,  including  the  problems  of  filtering,  prediction,  and  multi-class  hypothesis 
testing.  We  also  suggest  a  general  method  for  applying  the  finite-state  framework  to  the 
problem  of  signal  estimation  in  non-stationary  noise. 

5.2  Estimating  a  Signal  in  Additive  White  Gaussian  Noise 

5.2.1  Characterization  of  Signal,  Noise,  and  Observation 

We  begin  by  examining  the  simplest  possible  case  in  which  the  noise  process  {Vt}  is  made 
up  of  zero-mean  i.i.d.  Gaussian  random  variables.  This  constraint  on  {Vt}  implies  that  it 
can  be  represented  by  a  one-state  HMM  whose  output  pdf  is  a  one-component  Gaussian 
mixture.  In  keeping  with  the  notation  established  in  (5.3),  we  define  the  pdf  for  each  noise 
sample  Vt  by 


g[{v)=J\f{v;0,a'n).  (5.6) 

From  this  definition  of  the  additive  noise  process,  it  should  be  immediately  evident  that 
the  observed  process  { Zt }  is  itself  the  output  of  an  HMM  whose  structure  is  very  similar  to 
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the  HMM  for  {It}.  In  particular,  because  the  noise  is  white  and  therefore  contributes  no 
additional  complexity  to  the  temporal  structure  of  the  observation  when  it  is  combined  with 
the  signal,  the  initial  state  probabilities  and  state  transition  probabilities  in  the  HMM  for 
{Zt}  are  exactly  the  same  as  those  in  the  HMM  for  {!}}•  In  fact,  the  only  difference  between 
these  two  HMMs  lies  in  the  output  densities  associated  with  the  states  of  their  respective 
Markov  chains.  Fortunately,  we  can  easily  derive  the  output  pdf  for  the  observed  signal 
when  its  Markov  chain  is  in  state  i.  Since  the  signal  and  noise  are  statistically  independent, 
this  output  pdf,  which  we  denote  by  hi(-),  is  obtained  by  convolving  the  signal  and  noise 
output  densities  pi(-)  and  gi(-),  and  hence  is  given  by 

hi(z)  =9i(z)*g[(z)  (5.7) 


5.2.2  Decomposition  of  the  Conditional  Signal  Mean 

Having  established  the  stochastic  structure  for  the  signal,  the  noise,  and  the  observation  in 
this  case,  let  us  now  take  a  closer  look  at  the  conditional  expectation  in  (5.5),  and  attempt 
to  decompose  it  into  more  manageable  pieces.  We  begin  by  expressing  this  expectation  in 


the  form  of  an  integral,  and  we  then  introduce  further  conditioning  on  possible  values  of 
the  discrete  state  variable  for  the  underlying  Markov  chain.  This  yields 

E  {Yt\Z>0:N-l  =  Z0:AT— 1 } 

=  /  yt/yt|Zo:iv-i(ytlzO:^— i  —  zo-.N-i)dyt 

J  —DC 

(5.9) 

roc 

=  /  yt 

J  —00 

'  L 

V,  fYt,0t\ZO:N-l  (yt’  ®lZ0:iV— 1  =  ZOiiV-l)  dyt 

i=l 

(5.10) 

roo 

=  yt 

J  —DC 

-  L 

^Pr{©4  =  i|Zo:Ar_l  =  Z():.V-l}  • 

-i=l 

fyt |et>Zo:jv-i  =  h  zO:iv-i  =  zo:^/-i)  dyt. 

(5.11) 

We  can  now  simplify  this  last  expression  somewhat  by  observing  that 


/yt|0t,Zo:jv-i(yl0*  =  *,Zo:JV— 1  =  Z0:AT-l) 

=  fYt{et,Zt(y\®t  =  i,Zt  -  zt),  oo<y<oo.  (5.12) 

This  equality  follows  directly  from  the  properties  of  our  HMM-based  signal  model,  for  if  we 
are  given  the  true  value  of  the  underlying  state  variable  0j,  then  the  quantity  Yt  is  inde¬ 
pendent  of  all  other  signal  variables  {Ys|s  ^  <},  and  therefore  (owing  to  the  independence 
of  the  additive  noise)  is  independent  of  all  other  observations  {Zs\s  ^  £}.  Indeed,  when  we 
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axe  given  the  value  of  04,  the  only  observed  signal  variable  that  contains  any  information 
about  Yt  is  Zt  itself.  If  we  now  introduce  this  simplification  into  (5.11),  and  reverse  the 
order  of  integration  and  summation,  we  can  write 

E{Yt  |  Zo:iV— 1  =  Zo-JV-l} 

roo 

=  ^  Pr{0«  =  i|Zo:jv— i  =  zo:at-i}  /  ytfYt\et,zt (j/t|©t  =  i,  Zt  =  zt )  dyt 

*= l 

L 

=  ^2  =  *lZ0:JV— 1  =  ZO:N-l}E  {Yt |©t  =  *,  Zt  =  zt} .  (5.13) 

t=l 

Now  recall  that  when  0(  =  i,  the  conditional  density  characterizing  Yt  is  a  Gaussian  mix¬ 
ture  made  up  of  Mi  Gaussian  components.  Thus,  we  can  think  of  the  process  of  generating 
the  output  Yt  in  two  separate  stages,  where  the  first  stage  consists  of  selecting  at  random 
(according  to  the  pmf  constructed  from  the  weighting  coefficients)  exactly  one  of  the  Mi 
Gaussian  components  from  the  mixture,  and  the  second  stage  consists  of  generating  a  real¬ 
ization  according  to  the  selected  Gaussian  pdf.  Based  on  this  two-stage  output-generation 
paradigm,  we  can  introduce  an  additional  discrete  state  variable  to  keep  track  of  the  Gaus¬ 
sian  component  that  has  been  selected  at  time  t.  Let  us  denote  this  state  variable  by 
Then  the  expectation  appearing  on  the  right-hand  side  of  (5.13)  can  be  decomposed  further 
by  conditioning  on  all  possible  outcomes  for  4>t.  This  yields  the  new  expression 


E{Yt\et  =  i,Zt  =  zt} 

Mi 

=  Y2  Pr{$*  =  j\®t  =  b  Zt  =  zt}E  {yt|0t  =  *,  =  j,  Zt  =  zt}  .  (5.14) 

3= 1 

Upon  combining  (5.14)  and  (5.13),  we  find  that  the  optimal  estimate  of  Yt  given  Zo:/v-i 
can  be  written  as 

L 

tit  =  ^Pr{0t  =  i|Zo:jV— 1  =  Z0:JV— l}  • 

1=1 

Mi 

J2Pr{$t  =  j|0t  =  i,Zt  =  zt}E{Yt\$t  =  j,@t  =  i,Zt  =  zt).  (5.15) 
i=i 

Although  this  estimate  now  appears  to  have  a  more  complex  structure  than  it  did  originally 
in  (5.5),  it  is  nonetheless  in  a  form  that  is  much  more  amenable  to  the  development  of  an 
optimal  estimation  algorithm,  as  we  will  see  in  the  coming  sections. 

5.2.3  Analysis  of  Components  of  the  Optimal  Estimate 

Note  that  the  right-hand  side  of  (5.15)  is  made  up  of  three  basic  types  of  components:  (i) 
posterior  state  probabilities  associated  with  the  underlying  Markov  chain,  which  have  the 


134 


Chapter  5.  Using  Finite-State  Markov  Models  for  Signal  Estimation 


form  Pr{0j  =  i|Zo:jv— 1  =  zo:JV-i};  (ii)  posterior  sub-state  probabilities  associated  with  the 
Gaussian-mixture  pdf  for  a  particular  state,  which  have  the  form  Pr{<f>t  =  j\€>t  =  *,  Zt  =  zt); 
and  (iii)  expectations  conditioned  on  state  and  sub-state  outcomes,  which  have  the  form 
E{Yt\$t  =  j,  ©i  =  i,Zt  =  zt}.  We  shall  refer  to  these  quantities  as  terms  of  type  I,  type  II, 
and  type  III,  respectively.  In  the  remainder  of  this  subsection,  we  will  examine  and  attempt 
to  evaluate  each  of  these  terms,  beginning  with  terms  of  type  III  and  working  in  reverse 
order  until  we  finally  consider  terms  of  type  I. 

To  begin,  observe  that  if  we  know  the  events  0<  =  i  and  <3>t  =  j  have  occurred,  then  the 
conditional  pdf  for  Yt  is  purely  Gaussian.  Moreover,  if  we  know  that  Zt  =  zt,  then,  although 
this  additional  knowledge  may  affect  the  parameters  of  the  conditional  pdf,  the  pdf  itself 
remains  Gaussian,  since  Yt  and  Zt  are  (conditionally)  jointly  Gaussian  by  assumption.  It 
follows  that  a  term  of  type  III  is  linear  (or,  more  correctly,  affine)  in  the  observed  quantity 
zt-  Specifically,  for  such  a  term  we  can  write 

2 

E  {Tf|©t  =  i,$t=  j,  Zt  =  zt}  =  — ■  /2  -  Pi j)-  (5-16) 

°ij  +  Gn 


Let  us  next  consider  terms  of  type  II.  Note  that  a  term  of  this  kind  is  merely  the  posterior 
probability  that  the  Gaussian-mixture  state  variable  §t  has  taken  the  value  j  given  that 
the  Markov  chain  is  currently  in  state  i  and  the  current  value  of  our  observation  is  zt.  This 
probability  can  be  expressed  via  Bayes’  rule  as 


Pr{$t  =  j\@t 


Pr($«  =  j\®t  =  i}fzt\Qt,*t(zt\&t  =  *,  =  j) 


PijN  (zf,  Pij,  \JGij  +  Gn) 


Sfc=l  Pik-tf  (zu  Piki  \Jaik  +  an) 


(5.17) 


In  this  form,  the  posterior  sub-state  probabilities  can  now  be  easily  computed  in  terms  of 
known  model  parameters. 

The  only  remaining  quantities  that  must  be  computed  in  order  to  produce  the  optimal 
estimate  are  terms  of  type  I.  It  turns  out,  however,  that  these  terms  are  the  most  difficult 
of  all  three  types  to  evaluate,  owing  to  the  fact  that  they  depend  on  the  entire  observed 
sequence  zo;jv-i,  rather  than  only  on  a  single  observed  sample,  as  did  terms  of  type  II 
and  type  III.  A  term  of  type  I  can  still  be  calculated  efficiently,  but  the  procedure  for 
performing  this  calculation  is  rather  involved.  A  description  of  this  procedure  can  be  found 
in  Appendix  F. 


5.3  Analysis  of  Computation  for  HMM-Based  Estimation 

In  this  section,  our  goal  is  to  determine  the  amount  of  computation  required  to  generate 
an  estimate  with  the  formula  in  (5.15).  For  the  analysis  presented  here,  we  assume  for 
convenience  that  the  Gaussian-mixture  pdf  assigned  to  each  state  has  a  fixed  number  of 
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components,  say  M,  rather  than  a  state-dependent  number.  We  wish  to  derive  an  expression 
for  the  total  computational  cost  as  a  function  of  the  model  parameters  M  and  L,  and  as  a 
function  of  the  number  of  samples  N  in  the  given  realization  zo:at-i.  To  accomplish  this, 
we  first  determine  the  cost  incurred  in  producing  individual  terms  of  types  I,  II,  and  III, 
and  we  then  quantify  the  amount  of  computation  required  to  combine  these  terms  when 
forming  the  final  estimate.  Throughout  the  following  analysis,  we  assume  that  the  basic 
arithmetic  operations  of  addition,  subtraction,  multiplication,  and  division  all  require  the 
same  number  of  primitive  computer  instructions;  thus,  any  such  operation  will  be  assigned 
a  single  unit  of  computational  cost. 

It  turns  out  that  the  components  of  the  estimation  formula  above  are  listed  in  order  of 
decreasing  complexity.  Thus,  let  us  consider  these  components  in  reverse  order,  beginning 
with  terms  of  type  III.  Recall  that  any  term  of  type  III  has  the  form 

2 

=  *?’  =  hzt  =  zt}  =  Pij  +  -2  ,2  (zt  ~  Mij)-  (5-18) 

°ij  +  °w 

Although  this  expression  contains  certain  quantities  that  could,  in  order  to  reduce  com¬ 
putational  expense,  be  computed  and  stored  in  advance  for  use  during  the  algorithm,  its 
most  important  attribute  from  a  computational  standpoint  is  that  it  requires  a  fixed  num¬ 
ber  of  arithmetic  operations  (let  us  say  J  operations  altogether),  independent  of  the  model 
parameters  L  and  M. 

A  term  of  type  II  has  the  form 


Pr{$<  =  j|0f  =  i,  Zt  =  zt}  = 


PijM  (zf,  pij ,  yjafj Tojf) 
EiXi  PikM  ( ztifHk ,  \/<4+<r i?) 


(5.19) 


These  posterior  probabilities,  which  are  derived  using  Bayes’  rule,  can  be  constructed  during 
the  estimation  algorithm  by  normalizing  a  collection  of  M  weighted  Gaussian  pdf  values  so 
that  they  sum  to  unity.  If  we  assume  that  each  evaluation  of  a  Gaussian  pdf  consumes  G 
arithmetic  operations,  then  computing  all  M  of  the  unnormalized  likelihood  values  would 
take  M(G  +  1)  multiplications.  Now  suppose  that,  once  these  values  have  been  computed, 
they  can  be  stored  in  memory  temporarily  until  all  of  them  can  be  appropriately  scaled. 
Computing  the  normalizing  denominator  in  the  above  expression  then  requires  M  —  1  ad¬ 
ditions,  and  the  subsequent  scaling  of  the  original  values  requires  M  divisions,  bringing 
the  total  number  of  operations  performed  to  M(G  +  3)  —  1.  However,  we  can  view  these 
operations  as  being  distributed  over  all  M  terms;  hence,  in  order  to  evaluate  a  single  term 
of  type  II,  we  need  essentially  G  -1-  3  operations. 


Let  us  finally  determine  the  amount  of  computation  needed  to  evaluate  terms  of  type  I. 
Such  terms  are  computed  during  the  estimation  algorithm  using  special  recursive  formulas 
(derived  in  Appendix  F),  and  are  therefore  inherently  more  complicated  to  analyze.  A  term 
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of  type  I  is  given  by 


_  «*(*)&(») 

Y,j=iat(j)Pt{j) 


(5.20) 


where  at(-)  and  /?*(-)  are  the  recursively  computed  forward  and  backward  variables.  The 
above  expression  represents  the  normalization  of  L  likelihood  values,  one  for  each  state  of  the 
underlying  Markov  chain.  By  the  same  reasoning  we  used  earlier,  we  see  that  evaluating  all 
L  of  the  probabilities  requires  L  initial  multiplications  to  obtain  the  unnormalized  likelihood 
values,  then  L  —  1  additions  to  compute  the  scaling  term,  and  finally  L  divisions  to  perform 
the  normalization,  for  a  total  of  ZL  —  1  operations.  Because  these  operations  are  distributed 
over  the  L  terms,  each  term  requires  essentially  3  operations  after  at(-)  and  fit(-)  have  been 
computed. 

But  we  must  now  turn  our  attention  to  the  evaluation  of  these  forward  and  backward 
variables.  First,  we  observe  (see  Appendix  F  for  details)  that  the  expression  for  the  partic¬ 
ular  value  ctt(i)  is  given  by 


<*t+i(*)  = 


L 

^2Q{i,j)at(j) 


i= i 


hi(zt). 


(5.21) 


Since  each  term  in  the  bracketed  summation  requires  one  multiplication,  and  there  are  L 
terms  in  all,  the  sum  itself  requires  L  multiplications  and  L  —  1  additions,  for  a  total  of 
2L  —  1  operations.  After  the  sum  has  been  computed,  it  is  multiplied  by  an  M-component 
Gaussian-mixture  pdf  value,  which  consumes  GM  operations  (plus  one  operation  for  the 
subsequent  multiplication).  Thus,  to  evaluate  the  single  quantity  at(i),  we  need  2 L  +  GM 
operations.2 

A  somewhat  different  result  holds  when  computing  the  quantity  /3j(i),  which  is  given  by 


L 

Pt(i)  =  ^,Q(i,j)hj(zt+i)0t+i(j)-  (5-22) 

j=i 

Evaluating  the  pdf  again  takes  GM  operations,  but  this  pdf  value  is  multiplied  by  two  other 
numbers,  yielding  a  total  of  GM+2  operations  for  each  term  in  the  summation.  Since  there 
are  L  terms  in  all,  the  sum  requires  (GM  4-  2)L  multiplications  and  L  —  1  additions,  for  a 
grand  total  of  (GM  +  2 )L  +  (L—  1)  operations.  Combining  the  computational  requirements 
for  at(i),  (3t(i),  and  7 t(i),  we  see  that  (GM+ 5)L+GM+2  operations  are  needed  to  evaluate 
a  term  of  type  I. 

Now  let  us  begin  putting  the  above  pieces  together  to  determine  the  computational 


2We  have  implicitly  assumed  here  that  all  values  of  the  forward  variable  (and,  for  that  matter,  the 
backward  variable  as  well)  are  computed  only  once  and  then  stored  in  computer  memory  for  the  duration  of 
the  HMM-based  estimation  algorithm,  so  that  they  can  later  be  accessed  on  demand  at  no  expense.  If  this 
were  not  the  case,  the  amount  computational  cost  would  clearly  be  much  greater  than  the  amount  stated 
here. 
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requirement  for  the  original  estimation  formula  in  (5.15).  To  begin,  we  note  that  each 
term  in  the  inner  summation  of  the  estimation  formula  requires  the  multiplication  of  a 
term  of  type  II  and  a  term  of  type  III.  Thus,  each  term  takes  J  +  G  +  4  operations  to 
evaluate.  Since  there  are  M  such  terms  to  be  computed  for  a  fixed  value  of  i  in  the  outer 
summation,  the  inner  sum  takes  ( J  +  G  4-  5 )M  —  1  operations  to  evaluate,  including  the 
M  —  1  additions  needed  to  form  the  sum.  After  the  inner  sum  has  been  computed,  it  is 
multiplied  by  a  term  of  type  I,  and  thus  a  single  term  in  the  outer  summation  consumes 
(GM  +  5 )L  +(J  +  2G  +  5 )M  +  2  operations.  Finally,  since  the  outer  summation  consists  of 
L  such  terms,  the  overall  number  of  operations  required  (including  the  L  —  1  final  additions) 
is  (GM  +  5)T 2  +  [(J  +  2 G  +  5 )M  +  3]L  —  1.  In  later  sections,  we  shall  assume  that  the  first 
term  of  this  expression  is  the  dominant  term,  so  that  a  reasonable  first-order  approximation 
to  computational  cost  is  cML2  operations  per  sample,  where  c  is  an  appropriately  chosen 
constant.  Thus,  assuming  all  other  parameters  are  held  fixed,  we  see  that  total  computation 
is  linear  in  the  number  of  Gaussian  components  in  each  output  pdf,  M,  and  is  quadratic  in 
the  number  of  states  in  the  underlying  Markov  chain,  L. 

Of  course,  the  expression  we  have  just  derived  represents  the  computational  requirement 
only  for  a  single  time  index  t.  To  obtain  the  number  of  operations  needed  to  evaluate  the 
entire  waveform  estimate,  we  multiply  this  number  by  N ,  the  number  of  samples  in  the 
observation.  Thus,  the  amount  of  computation  required  to  generate  a  waveform  estimate 
depends  linearly  on  N  when  the  model  parameters  L  and  M  are  held  constant. 

5.4  HMM-Based  Performance  Evaluation  for  the  Gaussian 
Problem 

Thus  far  in  this  chapter,  we  have  discussed  only  the  methods  involved  in  HMM-based 
estimation.  In  later  sections,  we  will  also  discuss  certain  extensions  of  these  methods  so 
that  they  can  be  applied  in  more  complex  signal  processing  problems.  But  for  the  moment, 
let  us  shift  our  focus  away  from  the  details  of  algorithm  derivation,  and  instead  consider  how 
our  HMM-based  procedure  performs  in  a  specific  signal  estimation  problem.  We  devote  this 
section  to  a  discussion  of  an  experiment  which  uses  computer-simulated  signals  and  noise 
to  determine  how  the  performance  of  our  new  HMM-based  method  changes  as  a  function 
of  two  major  parameters:  (i)  the  number  of  states  in  the  finite-state  signal  model;  and  (ii) 
the  signal-to-noise  ratio  (SNR)  characterizing  the  observation. 

The  experiment  is  designed  to  address  a  simple,  purely  Gaussian  signal  estimation  prob¬ 
lem,  so  that  the  globally  optimum  processor  is  known  exactly  and  can  be  implemented  with 
ease.  In  particular,  the  true  source  signal  {Yt}  for  this  problem  is  assumed  to  obey  the 
second-order  difference  equation 

Ft  =  0.75Yt_i  +  0.2Yt_2  +  Wt ,  (5.23) 

where  the  sequence  {Wt}  consists  of  i.i.d.  Gaussian  random  variables,  each  having  a  mean 
of  zero  and  a  standard  deviation  of  unity.  The  additive  noise  sequence  {Vf},  on  the  other 
hand,  is  assumed  to  consist  of  i.i.d.  Gaussian  random  variables,  each  having  a  mean  of  zero 
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i 

P(i) 

Q(*V) 

i 

1.00 

1.00 

i 

p(*') 

<r(i) 

p(  0 

1 

0.00 

2.93 

1.00 

Table  5.1:  Parameter  definitions  for  1-state  HMM  representation  of  the  AR  Gaussian  pro¬ 
cess  Yt  =  0.75Yt_!  +  0.21f_2  +  Wt.  Prom  top  to  bottom:  specification  of  initial  state 
probability  and  state  transition  probability  for  Markov  chain;  specification  of  the  mean, 
standard  deviation,  and  weighting  coefficient  for  Gaussian-mixture  pdf  associated  with  the 
single  state. 


and  a  standard  deviation  cr^,  whose  value  is  specified  according  to  the  SNR  level  being 
tested. 

We  will  examine  the  signal  estimation  performance  achieved  by  using  each  of  five  dis¬ 
tinct  HMM-based  representations  of  {Y},  specifically  representations  containing  one,  two, 
three,  five,  and  nine  states.  The  parameter  values  for  each  signal  model  used  in  the  exper¬ 
iment  are  generated  directly  from  realizations  of  {Yt}  using  the  model-building  algorithms 
described  in  Chapter  4.  Because  the  true  signal  {Y*}  can  be  characterized  with  a  state-space 
representation  in  which  the  state  vector  is  two-dimensional,  the  states  in  each  HMM-based 
representation  of  {Yt}  are  made  to  correspond  to  disjoint  regions  in  the  two-dimensional 
coordinate  plane.  The  output  pdf  assigned  to  each  state  of  each  HMM  (with  the  exception 
of  the  one-state  HMM)  is  a  Gaussian  mixture  having  three  components.  For  the  one-state 
HMM,  the  output  pdf  is  taken  to  be  the  Gaussian  marginal  pdf  of  the  true  signal  {Yt}. 
The  parameter  values  for  all  five  of  the  finite-state  signal  models  are  given  in  Tables  5.1 
through  5.5. 


Before  we  begin  to  assess  the  estimation  performance  associated  with  each  of  the  finite- 
state  models  defined  above,  let  us  first  try  to  gain  an  appreciation  for  the  differences  among 
these  models  by  examining  the  statistical  structure  of  their  output  signals.  A  simple, 
qualitative  way  of  doing  this  is  to  generate  a  suitably  long  realization  of  the  output  signal 
from  each  HMM,  and  then  visually  compare  and  contrast  the  temporal  patterns  that  are 
present  in  the  resulting  collection  of  realizations.  In  Figures  5-1  (a)  through  5-1  (e),  we 
show  plots  of  output  waveforms  generated  by  each  of  the  finite-state  models  used  in  the 
experiment.  This  series  of  plots  is  ordered  from  top  to  bottom  according  to  the  number 
of  states  in  the  signal  model.  Note  that  each  successive  waveform  in  the  series  possesses 
the  same  basic  shape  as  its  predecessor,  but  also  contains  a  significant  amount  of  detail 
that  was  not  present  before.  The  similarity  among  these  waveforms  results  from  the  fact 
that  the  underlying  sequence  of  state  variable  values  for  each  waveform  was  generated  from 


Table  5.2:  Parameter  definitions  for  2-state  HMM  representation  of  the  AR  Gaussian  pro¬ 
cess  Yt  —  0.751f_i  +  Q.2Yt-2  +  Wt-  Prom  top  to  bottom:  partitioning  of  underlying  two- 
dimensional  state  space  (superimposed  on  an  elliptical  equi-probability  contour  of  the  true 
state- vector  pdf);  specification  of  initial  state  probabilities  and  state  transition  probabilities 
for  Markov  chain;  specification  of  means,  standard  deviations,  and  weighting  coefficients  for 
Gaussian-mixture  pdf  associated  with  each  state. 
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Table  5.3:  Parameter  definitions  for  3-state  HMM  representation  of  the  AR  Gaussian  pro¬ 
cess  Yt  =  0.751t-i  +  0.2y*_2  +  Wt.  From  top  to  bottom:  partitioning  of  underlying  two- 
dimensional  state  space  (superimposed  on  an  elliptical  equi-probability  contour  of  the  true 
state- vector  pdf);  specification  of  initial  state  probabilities  and  state  transition  probabilities 
for  Markov  chain;  specification  of  means,  standard  deviations,  and  weighting  coefficients  for 
Gaussian-mixture  pdf  associated  with  each  state. 
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i 

P(i ) 

Q(v) 

i 

0.16 

0.86 

0.14 

0 

0 

0 

2 

0.23 

0.10 

0.74 

0.16 

0 

0 

3 

0.22 

0 

0.17 

0.66 

0.17 

0 

4 

0.23 

0 

0 

0.16 

0.74 

0.10 

5 

0.16 

0 

0 

0 

0.14 

0.86 

i 

m(0 

<r(i) 

P{i) 

i 

-6.57 

-4.96 

-3.38 

1.03 

0.94 

0.76 

0.10 

0.26 

0.64 

2 

-2.12 

-1.38 

-0.67 

0.63 

0.50 

0.50 

0.43 

0.39 

0.18 

3 

-0.74 

-0.11 

0.45 

0.43 

0.36 

0.52 

0.25 

0.28 

0.47 

4 

0.67 

1.38 

2.12 

0.50 

0.50 

0.63 

0.18 

0.39 

0.43 

5 

3.38 

4.96 

6.57 

0.76 

0.94 

1.03 

0.64 

0.26 

0.10 

Table  5.4:  Parameter  definitions  for  5-state  HMM  representation  of  the  AR  Gaussian  pro¬ 
cess  Yt  =  0.75Yt_i  +  0.2Yt_2  +  Wt.  Prom  top  to  bottom:  partitioning  of  underlying  two- 
dimensional  state  space  (superimposed  on  an  elliptical  equi-probability  contour  of  the  true 
state- vector  pdf);  specification  of  initial  state  probabilities  and  state  transition  probabilities 
for  Markov  chain;  specification  of  means,  standard  deviations,  and  weighting  coefficients  for 
Gaussian-mixture  pdf  associated  with  each  state. 
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P(i) 

Q(v)  I 

0.074 

0.85 

0.15 

0 

0 

0 

0 

0 

0 

0 

0.111 

0.10 

0.69 

0.20 

0.01 

0 

0 

0 

0 

0 

0.126 

0 

0.18 

0.59 

0.22 

0.01 

0 

0 

0 

0 

0.125 

0 

0 

0.22 

0.52 

0.24 

0.02 

0 

0 

0 

0.128 

0 

0 

0.02 

0.23 

0.50 

0.23 

0.02 

0 

0 

0.125 

0 

0 

0 

0.02 

0.24 

0.52 

0.22 

0 

0 

0.126 

0 

0 

0 

0 

0.01 

0.22 

0.59 

0.18 

0 

0.111 

0 

0 

0 

0 

0 

0.01 

0.20 

0.69 

0.10 

0.074 

0 

0 

0 

0 

0 

0 

0 

0.15 

0.85 

/*(*) 

!  <r(i) 

!  p(0 

-6.04 

-4.86 

1.13 

0.58 

0.56 

0.27 

0.27 

0.46 

-2.86 

0.49 

0.49 

0.53 

0.25 

0.41 

0.34 

-2.70 

-2.40 

-1.75 

0.47 

0.42 

0.50 

0.12 

0.27 

0.61 

-1.57 

-1.18 

-0.58 

0.46 

0.35 

0.44 

0.24 

0.27 

0.49 

-0.74 

-0.27 

0.28 

0.37 

0.41 

0.48 

0.11 

0.29 

0.60 

0.58 

1.18 

1.57 

0.44 

0.35 

0.46 

0.49 

0.27 

0.24 

1.75 

2.40 

2.70 

0.50 

0.42 

0.47 

0.61 

0.27 

0.12 

2.86 

3.50 

4.18 

0.53 

0.49 

0.49 

0.34 

0.41 

0.25 

4.86 

6.04 

7.10 

0.56 

0.58 

1.13 

0.46 

0.27 

0.27 

Table  5.5:  Parameter  definitions  for  9-state  HMM  representation  of  the  AR  Gaussian  pro¬ 
cess  Yt  —  0.75Yf_i  -i-  0.2Yt-2  +  Wt-  From  top  to  bottom:  partitioning  of  underlying  two- 
dimensional  state  space  (superimposed  on  an  elliptical  equi-probability  contour  of  the  true 
state- vector  pdf);  specification  of  initial  state  probabilities  and  state  transition  probabilities 
for  Markov  chain;  specification  of  means,  standard  deviations,  and  weighting  coefficients  for 
G aussian- mixt ure  pdf  associated  with  each  state. 
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Figure  5-1:  Waveforms  generated  by  increasingly  accurate  HMM  representations  of  the  AR 
Gaussian  process  Y*  =  0.75Yt_i  +  0.2Y*_2  +  Wt:  (a)  output  of  one-state  HMM;  (b)  output  of 
two-state  HMM;  (c)  output  of  three-state  HMM;  (d)  output  of  five-state  HMM;  (e)  output 
of  nine-state  HMM. 
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a  common  pseudo-random  noise  sequence.  The  progressive  increase  in  signal  detail  is  a 
natural  consequence  of  the  corresponding  increase  in  model  complexity. 

The  waveform  produced  by  the  one-state  HMM,  shown  in  Figure  5-1  (a),  is  the  coarsest 
possible  finite-state  representation  of  the  actual  signal,  because  any  HMM  with  only  a  single 
state  is  capable  of  modeling  only  the  marginal  statistics  of  the  signal.  (Such  an  HMM 
cannot  model  any  temporal  dependence  exhibited  by  the  original  signal,  since  the  output 
samples  of  a  one-state  HMM  are  necessarily  statistically  independent.)  On  the  other  hand, 
the  waveform  produced  by  the  two-state  HMM  exhibits  some  temporal  correlation,  but 
the  representation  is  coarse,  abruptly  switching  back  and  forth  between  two  gross  output 
levels  over  time.  As  we  continue  to  scan  through  this  series  of  plots,  we  observe  a  greater 
and  greater  degree  of  refinement  in  signal  structure,  until  finally  we  come  to  the  waveform 
produced  by  the  nine-state  HMM.  This  waveform,  which  is  shown  Figure  5-l(e),  is  almost 
indistinguishable  in  character  from  a  waveform  that  would  be  generated  by  the  original 
autoregressive  linear  model  in  (5.23). 

Although  it  is  both  interesting  and  useful  to  examine  realizations  of  the  output  signals 
of  various  finite-state  models,  as  we  have  done  with  Figure  5-1,  this  exercise  does  not 
necessarily  help  us  to  predict  the  estimation  performance  that  will  be  achieved  by  each 
HMM.  In  the  remaining  portion  of  this  section,  we  will  quantify  the  performance  associated 
with  each  model  and  compare  its  performance  to  that  of  the  optimal  Wiener  smoother.  For 
the  particular  estimation  problem  we  have  chosen,  where  the  observation  is  a  Gaussian  signal 
combined  with  independent  white  Gaussian  noise,  a  reasonable  measure  of  performance  is 
the  gain  in  signal-to-noise  ratio  (SNR)  achieved  by  the  estimator.  Of  course,  in  order  to 
calculate  this  gain,  we  must  have  suitable  definitions  for  both  the  input  SNR  and  the  output 
SNR,  which  we  denote,  respectively,  by  SNR;n  and  SNRout-  For  the  first  of  these  quantities, 
SNRin,  we  will  use  the  classical  definition  given  by 

SNRin  =  101og10||^.  (5.24) 

For  the  remaining  quantity,  SNRoUt,  the  definition  is  not  as  straightforward,  for  it  requires 
that  we  view  the  output  estimate  Yt  as  being  composed  of  the  original  signal  value  Yt 
together  with  an  additive  noise  component  Ut,  which  actually  represents  the  estimation 
error.  In  other  words,  we  need  to  express  Yt  in  the  form 


Yt  =  Yt  +  Ut 
=  Yt  +  (Yt  -  Yt). 

With  this  construction,  we  can  define  the  output  SNR  as3 


SNRout  —  10  logjo 


E{Y?} 

E{(Yt-Yt)2}' 


(5.25) 

(5.26) 

(5.27) 


3Because  the  term  (Yt  — Yt )  is  more  naturally  viewed  as  an  error  term  rather  than  an  additive  noise  term, 
the  quantity  SNRout  is  also  commonly  referred  to  as  the  signal-to-error  ratio,  or  SER. 
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In  all  cases,  the  value  of  SNR;n  can  be  expressed  in  closed  form  in  terms  of  the  signal 
covariance  matrix  Cy  and  the  noise  variance  °n  that  is  needed  to  achieve  the  desired  SNR 
level  in  a  specific  case.  The  input  SNR  is  given  by 

SNRin  =  101og10^^,  (5.28) 

where  tr(-)  represents  the  matrix  trace  operator  and  N  is  the  observation  length.  Further¬ 
more,  whenever  the  Wiener  smoother  is  applied  to  the  observation,  the  quantity  SNR<,ut 
can  also  be  expressed  in  closed  form.  This  quantity  is  given  by 

SNR-  -  10IOg'»  tr  (Cy  -  Cy(Cy  +  0;;i)-‘Cy) ' 

Whenever  one  of  the  HMM-based  smoothers  is  applied,  however,  we  must  resort  to  an 
estimate  for  the  denominator  in  the  expression  for  SNRoUt.  This  estimate  is  obtained 
simply  by  taking  the  arithmetic  mean  of  the  squared  values  of  the  actual  error  waveform. 

For  our  experiment,  estimation  performance  was  measured  for  each  signal  model  at 
input  SNR  levels  of  -10  dB,  -5  dB,  0  dB,  5  dB,  and  10  dB.  For  each  input  SNR  level, 
a  total  of  1000  waveforms  were  processed  by  each  of  the  finite-state  estimators  described 
above.  In  addition,  each  waveform  was  processed  by  the  Wiener  smoother,  whose  output  is 
given  by 


yO:JV-l  =  Cy  (Cy  +  o[ll)  1Z0:N-l-  (5.30) 

Each  input  waveform  was  300  samples  in  length.  In  Figure  5-2,  we  show  the  results  of 
a  single  experimental  trial  for  which  the  input  SNR  level  was  0  dB.  The  collection  of 
waveforms  plotted  in  this  figure  includes  realizations  of  the  original  source  signal,  the  noisy 
observed  signal,  and  the  estimates  generated  by  each  HMM-based  smoother  and  by  the 
Wiener  smoother.  Each  of  these  waveforms  is  shown  next  to  its  associated  residual,  which 
was  created  by  subtracting  the  original  from  the  estimate. 

In  Figures  5-3(a)  and  5-3(b),  we  give  a  graphical  summary  of  the  performance  of  the 
HMM-based  smoothers,  as  well  as  that  of  the  globally  optimal  Wiener  smoother.  Let  us 
now  consider  each  of  these  plots  in  turn.  Observe  that  a  single  curve  on  the  plot  shown 
in  Figure  5-3(a)  indicates  the  output  SNR  that  was  achieved  by  a  particular  finite-state 
estimation  algorithm  as  a  function  of  the  input  SNR.4  It  is  clear  from  this  figure  that  the 
nine-state  HMM  performs  nearly  as  well  as  the  optimal  Wiener  smoother.  (Note  that  we 


4  To  interpret  these  curves  properly,  we  must  be  aware  that  the  plotted  values  for  the  output  SNR  tire 
somewhat  deceiving,  for  in  certain  cases  they  seem  to  suggest  that  extraordinary  estimation  gains  have  been 
made,  especially  at  low  input  SNR  levels.  For  example,  it  appears  that  the  one-state  HMM  achieves  an 
estimation  gain  of  more  than  10  dB  when  the  input  SNR  is  -10  dB.  To  understand  why  this  is  true,  we  must 
keep  in  mind  that  an  optimal  estimator  will  produce  a  value  close  to  the  prior  mean  of  the  underlying  signal 
whenever  the  input  SNR  is  extremely  small,  since  very  little  new  information  can  be  extracted  from  the 
observation  itself.  In  this  case,  since  our  one-state  HMM-based  estimator  is  nothing  more  than  an  optimal 
memoryless  Wiener  smoother  (i.e.,  a  one-sample  FIR  Wiener  filter),  and  since  the  underlying  signal  has  a 
prior  mean  of  zero,  the  estimator  tends  to  produce  values  very  near  zero.  Of  course,  this  produces  a  residual 
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Figure  5-2:  Plots  of  estimated  versions  of  the  original  waveform  (left-hand  column)  and  their 
corresponding  residual  waveforms  (right-hand  column)  after  subtracting  out  the  original: 
(a)  original  waveform;  (b)  original  waveform  combined  with  additive  noise  (0  dB  SNR);  (c) 
estimate  of  original  waveform  produced  by  one-state  HMM,  and  its  residual;  (d)  estimate 
and  residual  produced  by  two-state  HMM.  ( Continued  on  following  page.) 


Chapter  5.  Using  Finite-State  Markov  Models  for  Signal  Estimation 


147 


10 

5 


-5 . :■ .  . 

_iol - * - ; - 

0  100  200  300 


0  100  200  300 


0  100  200  300 


iWvAM 

0  100  200  300 


Figure  5-2:  ( Continued  from  previous  page.)  (e)  estimate  and  residual  produced  by  three- 
state  HMM;  (f)  estimate  and  residual  produced  by  five-state  HMM;  (g)  estimate  and  resid¬ 
ual  produced  by  nine-state  HMM;  (h)  estimate  and  residual  produced  by  optimal  Wiener 
smoother. 
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have  labeled  the  performance  curve  associated  with  the  Wiener  smoother  as  “inf-state” 
to  indicate  that  this  curve  could  be  also  achieved  by  using  an  HMM  with  an  infinite  number 
of  states.)  In  addition,  we  see  that  the  performance  of  the  five-state  HMM  is  within  1  dB  of 
the  performance  of  the  Wiener  smoother  over  the  entire  range  of  input  SNR  levels  tested. 
Thus,  we  have  the  rather  surprising  conclusion  that,  for  this  purely  Gaussian  problem, 
near-optimal  performance  can  be  achieved  by  using  only  a  very  coarse  (albeit  well-designed) 
finite-state  signal  model.5  As  one  might  predict,  however,  performance  can  degrade  rapidly 
if  the  model  becomes  too  coarse.  This  conclusion  can  be  drawn  from  the  performance  curves 
associated  with  the  three-state,  two-state,  and  one-state  models.  We  note  in  particular  that 
the  estimator  based  on  the  one-state  HMM  (which  is,  as  we  have  already  mentioned,  the 
best  possible  memoryless  operation  that  can  be  performed  on  the  observed  signal),  yields 
only  a  slight  improvement  in  SNR,  even  at  the  highest  input  SNR  level  tested;  in  fact,  for 
this  case,  its  performance  lags  behind  that  of  the  Wiener  smoother  by  approximately  3  dB. 

Figure  5-3(b)  displays  exactly  the  same  data  shown  in  Figure  5-3(a),  but  in  a  slightly 
different  format.  Specifically,  a  single  curve  on  this  plot  now  indicates  the  output  SNR 
that  was  achieved  at  a  fixed  input  SNR  level  as  a  function  of  the  number  of  states  used  in 
the  HMM-based  estimation  algorithm.  Thus,  each  successive  point  on  the  curve  explicitly 
indicates  the  marginal  value  of  adding  the  corresponding  number  of  states  to  the  model. 


5.5  Extensions  of  the  Basic  Signal  Estimation  Algorithm 

In  the  previous  section,  we  addressed  only  the  problem  of  estimating  a  stationary  Gaussian 
signal  that  has  been  corrupted  by  independent  additive  white  Gaussian  noise.  We  chose  to 
consider  this  classical  estimation  problem  first  not  only  because  of  its  analytical  simplicity, 
but  also  because  a  theoretical  bound  on  performance  was  available  for  this  case.  Specifically, 
the  performance  of  the  globally  optimal  solution  —  i.e.,  the  Wiener  smoother  —  could  be 
calculated  in  advance  and  compared  directly  to  the  performance  achieved  by  using  various 
finite-state  HMMs  for  the  underlying  signal  and  noise  combination.  Clearly,  the  recursive 
algorithms  we  used  to  solve  this  purely  Gaussian  problem  could  be  applied  just  as  easily 
to  the  problem  of  estimating  a  non-Gaussian  signal  in  additive  white  Gaussian  noise.  The 
only  modification  required  in  such  a  case  would  be  to  replace  the  original  HMM,  which 
was  designed  to  represent  the  Gaussian  signal,  with  a  new  HMM  designed  to  represent  the 
non-Gaussian  signal.  The  sequence  of  computations  subsequently  performed  to  generate 
a  signal  estimate  would  be  exactly  the  same  as  before.  Thus,  we  already  have  at  our 
disposal  a  method  for  estimating  any  stationary  signal  (provided,  of  course,  that  the  signal 
is  adequately  characterized  by  an  HMM)  —  either  Gaussian  or  non-Gaussian  —  that  has 


error  signal  that  is  approximately  the  same  as  the  original  signal,  which  in  turn  causes  the  output  SNR  to 
be  approximately  0  dB,  even  though  the  input  SNR  was  -10  dB.  Therefore,  because  the  estimation  gains 
will  usually  appear  to  be  substantial  at  very  small  input  SNR  levels,  the  output  SNR  must  be  interpreted 
with  care.  At  moderate  to  high  input  SNR  levels,  the  difference  SNRoUt  —  SNRin  can  be  interpreted  more 
directly  as  a  reduction  in  the  original  noise  power. 

5  We  will  present  examples  later  in  the  chapter  suggesting  that  this  same  conclusion  may  extend  to  more 
complicated  non-Gaussian  estimation  problems  as  well. 
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Figure  5-3:  Performance  comparison  among  different  HMM-based  estimators  for  the  AR 
Gaussian  process  Yt  =  0.75it_i  +  0.2Ft-2  +  Wt  in  various  levels  of  additive  Gaussian  noise: 
(a)  output  SNR  as  a  function  of  input  SNR  for  each  of  the  HMMs  used;  (b)  output  SNR 
as  a  function  of  the  number  of  states  in  the  HMM  for  each  noise  level. 
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been  additively  combined  with  white  Gaussian  noise. 

A  problem  we  have  not  yet  discussed,  however,  is  that  of  estimating  a  stationary  signal 
in  the  presence  of  additive  non-Gaussian  noise.  For  this  more  challenging  problem,  there 
are  two  fundamental  variations  to  be  considered,  according  to  whether  the  samples  of  the 
noise  are  temporally  independent  or  temporally  dependent.  These  two  variations  of  the 
problem  —  which  will  be  addressed,  respectively,  in  the  next  two  subsections  —  lead  to 
modifications  of  the  basic  estimation  algorithm  that  have  different  degrees  of  complexity 
and,  accordingly,  that  require  different  amounts  of  computation  per  output  sample.  When 
addressing  either  of  these  variations,  we  shall  restrict  the  scope  of  the  estimation  problem 
in  the  usual  way  by  assuming  that  any  non-Gaussian  density  characterizing  either  signal  or 
noise  can  be  adequately  modeled  by  a  Gaussian-mixture  density,  provided  that  the  mixture 
contains  a  sufficient  (but  finite)  number  of  elements. 

After  we  have  considered  these  two  main  variations  of  the  estimation  problem  in  some 
detail,  we  then  show,  in  the  third  subsection,  how  the  corresponding  algorithms  can  be 
extended  even  further  by  applying  them  to  the  related  problem  of  signal  separation  —  i.e., 
the  reconstruction  of  many  individual  signals  of  interest  that  have  been  additively  combined 
(and  perhaps  also  corrupted  by  additive  noise). 

5.5.1  Estimating  a  Signal  in  Additive  White  Non-Gaussian  Noise 

We  begin  our  discussion  by  demonstrating  how  to  extend  the  previously  derived  estimation 
algorithm  to  handle  the  case  in  which  the  samples  of  additive  non-Gaussian  noise  are  i.i.d. 
As  before,  we  assume  that  the  signal  and  noise  are  independent,  and  that  the  tth  element 
in  our  sequence  of  observations  {Ztj  is  given  by 


Zt  =  Yt  +  Vt.  (5.31) 

Here,  the  sequence  {Tj}  once  again  represents  the  output  of  a  stationary  L-state  HMM,  but 
the  sequence  {Vt}  consists  of  i.i.d.  random  variables,  each  now  having  a  Gaussian-mixture 
pdf  <7j(-)  defined  by 


M' 

9i  (v)  =  A j ’  °i j) '  (5-32) 

j= i 

As  we  will  soon  see,  even  though  this  modest  change  in  the  density  of  the  noise  is  the 
only  modification  to  the  observation  model  considered  earlier,  it  leads  to  a  substantially 
more  complex  estimator  than  the  one  we  derived  in  the  Gaussian-noise  case.  The  added 
complexity  stems  from  the  need  to  keep  track  of  a  new  discrete-valued  state  variable  at 
each  time  t  that  indicates  which  of  the  M'  Gaussian  components  in  the  above  Gaussian- 
mixture  pdf  was  the  true  density  for  the  noise  sample  added  at  time  t.  Recall  that  a  discrete 
variable  of  this  kind  was  introduced  earlier  during  the  development  of  the  original  estimation 
algorithm;  specifically,  we  used  the  sub-state  variable  to  indicate  which  component  of  the 
Gaussian-mixture  pdf  (associated  with  the  current  state  of  the  Markov  chain)  was  selected 
to  generate  the  signal  value  at  time  t.  We  now  introduce  an  analogous  variable  <f>'t  to  indicate 
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which  component  of  the  Gaussian-mixture  pdf  for  the  noise  was  selected  at  time  t. 

With  these  state  variables  defined,  we  can  develop  a  new  estimation  formula  similar  to 
the  one  developed  for  the  Gaussian-noise  case.  Specifically,  through  appropriate  condition¬ 
ing  on  the  potential  outcomes  for  the  state  variables  of  the  model,  we  may  write 

E{Yt\Z0:N-l  =  Z0:AT-l} 

L 

=  £  Pr{0(  =  i|Z0:jv— i  =  zcwv-i  }E{Yt\et  =  i,Zt  =  zt}  (5.33) 

i=l 

L 

=  E  Pr{©t  =  i|Zo:AT-l  =  Z0:7V-l}  ' 

t=l 

Mi  M‘ 

E  E  Pr^‘  =  *  =  J'l©t  =  i,Zt  =  Zt}  ■ 

3=1  r= i 

E{Yt |©t  =  i,  =  j,  *'t  =  /,  Zt  =  zt}.  (5.34) 

We  see  from  this  last  expression  that  there  are  three  basic  types  of  quantities  that  must 
be  computed  in  order  to  produce  the  desired  estimate.  In  particular,  we  need  to  com¬ 
pute  posterior  probabilities  of  the  form  Pr{@t  =  i|Zo-jv-i  =  zo:jv-i}  (appearing  in  the 
outer  summation),  posterior  probabilities  of  the  form  Pr{$t  =  j.  4^  =  j'|©t  =  i,Zt  =  zt} 
(appearing  in  the  inner  summation),  and  conditional  expectations  of  the  form  E{Yt\&t  = 
j,  $'{  =  j',Qt  =  h  Zt  —  zt}.  Once  we  have  shown  how  each  of  these  quantities  is  com¬ 
puted,  our  algorithm  for  estimating  a  stationary  signal  in  white  non-Gaussian  noise  will  be 
completely  specified. 

Let  us  first  consider  the  conditional  expectation  appearing  in  (5.34).  Note  that,  under 
the  conditions  assumed  for  this  expectation,  the  random  variable  Yt  has  a  purely  Gaussian 
pdf.  The  parameters  of  this  pdf  will  naturally  depend  on  the  values  given  for  the  state 
variables  ©t,  4>t,  and  4>j,  as  well  as  for  the  observed  variable  zt;  the  conditional  expectation 
itself  will  be  affine  in  z\.  It  is  straightforward  to  show  that  this  expectation  is  given  by 

2 

E{Yt\@t  =  i,$t=  j,  $'t  =  j',  Zt  =  zt}  =  iHj  +  2  _T'  72"  (zt  ~  Pij  ~  A»;i j>)-  (5-35) 

°ij  +  °i  f 


To  describe  how  the  remaining  posterior  probabilities  are  computed,  we  first  reiterate 
a  key  observation  about  the  sequence  {Zt}  that  we  made  in  our  earlier  discussion.  In 
particular,  this  sequence,  like  the  sequence  {Ft},  is  the  output  of  a  stationary  L-state  HMM. 
In  fact,  the  parameters  defining  the  underlying  Markov  chain  for  {Zt}  (i.e.,  the  initial  state 
probabilities  and  the  state  transition  probabilities)  are  exactly  the  same  as  those  for  {Ft}. 
The  only  difference  between  the  HMM  for  {Zt}  and  the  HMM  for  {Ft}  lies  in  the  output 
densities  associated  with  the  states  of  their  respective  Markov  chains.  If  the  Markov  chain 
for  {Ff}  is  in  state  i  at  time  t,  the  conditional  pdf  characterizing  the  output  sample  Ft  is 
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given  by 


Mi 

9i(y)  =  'y  ^  PijN{y  \  Pij ,  crtj ) •  (5.36) 

3=1 

Since  the  signal  and  noise  are  assumed  statistically  independent,  we  can  derive  the  corre¬ 
sponding  conditional  pdf  for  Zt  simply  by  convolving  the  two  pdfs  (?,(•)  and  <?((•).  For  the 
previously  considered  case  in  which  the  noise  was  purely  Gaussian,  this  convolution  resulted 
only  in  a  modification  of  the  variance  of  each  component  in  the  original  Gaussian-mixture 
pdf  for  the  signal;  consequently,  the  overall  number  of  components  in  the  resulting  Gaus¬ 
sian  mixture  did  not  change.  However,  because  the  pdf  for  the  noise  is  now  also  a  Gaussian 
mixture  (which  in  general  has  more  than  a  single  component),  the  convolution  procedure 
could  greatly  increase  the  number  of  components  in  the  conditional  pdf  for  Zt.  In  fact,  it 
can  be  readily  verified  that  the  result  of  the  convolution  in  this  case  is  given  by 

Mi  M’ 

hdz)  =  Pijp’ij'M  (z\  Pij  +  p'l ji,  yjcrjj  +  ,  (5.37) 

3=1  j' =i 

which  is  a  weighted  sum  of  MjM'  Gaussian  components.  But  note  that,  in  spite  of  the 
M'- fold  increase  in  the  number  of  parameters  needed  to  describe  the  conditional  output 
pdf,  the  pdf  itself  is  still  merely  a  Gaussian  mixture  with  a  finite  number  of  constituent 
elements.  Therefore,  the  fundamental  mathematical  structure  of  the  observed  signal  {Zt}  in 
the  non-Gaussian-noise  case  is  identical  to  that  of  the  observed  signal  in  the  Gaussian-noise 
case;  specifically,  either  signal  is  the  output  of  an  HMM  defined  such  that  the  conditional 
pdf  associated  with  each  state  of  its  underlying  Markov  chain  is  a  Gaussian-mixture  pdf. 
Moreover,  in  view  of  the  foregoing  description  of  { Zt] ,  it  is  clear  that  the  values  of  all 
parameters  defining  this  structure  can  be  easily  calculated  in  advance  once  {Yt}  and  {Vt} 
are  specified. 


This  is  a  significant  observation  because  it  means  that  precisely  the  same  algorithms 
that  were  used  in  the  Gaussian-noise  case  can  be  used  once  again  to  compute  all  of 
the  required  posterior  state  probabilities.  In  particular,  we  can  calculate  the  probability 
Pr{0t  =  i | Z o ; 7vT — l  =  zo:Ar— l }  by  applying  the  recursions  already  developed  for  the  forward 
and  backward  variables  and  0t(i),  and  subsequently  by  combining  the  resulting  values 
to  form  the  equivalent  quantity  7 $(*).  Moreover,  we  can  calculate  the  remaining  probability 
Pr{$t  =  j,  =  j'\&t  =  i,Zt  =  zt}  through  a  direct  application  of  Bayes’  rule,  which  in 
this  case  is  given  by 


Pr{*t=j,*,t=j,]et  =  i,Zt  =  zt} 

_  Pr{$t  =  j,  §'t  =  j'\et  =  i}Pr{Zt  =  Zt\$t  =  j,  =  j',  &t  =  t} 

Efc^i  Pr {*t  =  k,  Vt  =  k'\©t  =  i}Pr{Zt  =  zt |*t  =  k,  =  k',  0t  =  i} 
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Pijp'ifM  +  Pi? ,  sj°%  +</) 

-—  ,  . .  •  (5.38) 

Ylk=l  Sfc'=l  PikP'ik'N  (z*5  Pik  +  P'\k'  ’  \j °lk  +  °lfc') 

With  these  rather  straightforward  modifications  to  the  original  estimation  algorithm,  we  are 
now  equipped  to  handle  the  case  in  which  the  added  noise  is  non-Gaussian  and  white.  Note, 
however,  that  in  return  for  this  added  capability  we  must  pay  a  premium  in  the  form  of  an 
increase  in  computation.  To  see  this,  suppose  for  convenience  that  all  of  the  Mj  are  identical, 
e.g.,  that  —  M  for  i  =  1,2,-*-  ,  L,  and  recall  from  our  earlier  computational  analysis  that 
the  required  number  of  operations  in  the  Gaussian-noise  case  was  approximately  cML2  per 
sample.  Now,  since  the  number  of  components  in  each  output  pdf  is  MM' ,  rather  than  M, 
the  computational  expense  increases  directly  by  a  factor  of  M'  to  approximately  cMM'L2 
operations  per  sample. 

5.5.2  Estimating  a  Signal  in  Additive  Colored  Non-Gaussian  Noise 

We  now  turn  our  attention  to  the  case  in  which  the  additive  noise  is  not  only  non-Gaussian, 
but  also  colored  (i.e.,  consists  of  samples  that  are  temporally  dependent).  For  this  case, 
we  assume  that  the  corrupting  sequence  {Vt}  possesses  a  probabilistic  structure  similar 
to  that  of  the  signal  {Ft},  i.e.,  it  is  the  output  of  an  stationary  finite-state  HMM.  More 
specifically,  we  assume  that  at  each  time  t,  the  Markov  chain  {©(}  associated  with  the 
corrupting  sequence  can  be  in  any  of  the  L'  possible  states  {1, 2,  •  •  •  ,  2/},  and  we  denote  the 
initial  state  probabilities  and  state  transition  probabilities  for  this  chain  by  {P'ii)}^  and 
{Q'(i,  respectively.  Furthermore,  we  assume  that  the  output  pdf  associated  with 

the  zth  state  of  the  Markov  chain  is  a  Gaussian  mixture  having  M[  constituent  elements, 
and  we  express  this  pdf  as 


M'i 

9i(v)  =  X>^(«;/4,4),  *  =  1,2,...  ,L'.  (5.39) 

j=i 

It  is  understood  that  the  V  output  densities  specified  above  remain  constant  for  each  time 
t.  All  parameters  defining  both  the  HMM  for  the  noise  and  the  HMM  for  the  signal  are 
assumed  known. 

Because  the  noise  sequence  now  has  temporal  dynamics  induced  by  its  underlying 
Markov  chain,  the  optimal  estimation  formula  for  this  case  includes  an  additional  layer 
of  complexity  that  was  not  present  in  the  previously  considered  white-noise  case.  To  derive 
the  new  formula,  we  will  use  the  variables  @t  and  0(  to  indicate  the  states  of  the  respective 
Markov  chains  for  the  signal  and  noise  at  time  t,  and,  just  as  before,  we  will  use  the  vari¬ 
ables  and  to  indicate  the  components  of  the  respective  Gaussian-mixture  densities 
that  were  selected  to  generate  the  signal  and  noise  values  at  time  t.  Once  again,  through 
appropriate  conditioning  on  the  potential  outcomes  for  these  state  variables  of  the  model, 
we  may  write 


-S{Ft|Z0:JV-l  =  ZQ:iV-l} 
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L  V 

=  53  5^  Pf{0t  =  i,  Q't  =  2'|Zo:JV-l  =  Z0:A'-l}  • 
i=l  i'=l 

E{Yt |0t  =  i,  Q't  =  i\  Zt  =  zt)  (5.40) 

L  V 

=  5Z  =  =  ^|Zo:JV-l  =  ZO:A--l}  • 

i- 1  i'= 1 

M;  a/;, 

53  53  Pr{$£ = j,  =  /i0t  =  *,©;  =  zt  =  *}  ■ 

3= 1 1-1 

^{yt|0t  =  *,  ©{  =  *•',  $£  =  j,  z*  =  zt}.  (5.4i) 

Note  that  this  last  expression  has  the  same  basic  form  as  (5.34),  except  that  we  have  now 
incorporated  the  additional  state  variable  0't  into  both  the  posterior  probabilities  and  the 
conditional  expectation.  As  in  the  previous  case,  the  expectation  in  this  expression  is  easy 
to  compute,  since  the  random  variable  Y£  has  a  Gaussian  pdf  when  conditioned  on  the 
values  of  the  model  variables  0£,  0J,  $£,  4>j,  and  Zt.  In  this  case,  the  expectation  takes  the 
slightly  different  form 

E{Yt\Qt  =  i,  0'  =  i',  =  j ,  =  f,  Zt  =  zt} 

of, 

—  Mij  +  ~2~-  J2  ( zt  —  Pij  —  Pi'ji )  ■  (5.42) 

aij  +  ai'i' 

To  compute  the  posterior  probabilities  appearing  in  (5.41),  we  can  use  the  same  techniques 
developed  earlier,  but  we  must  first  establish  that  the  observed  sequence  {Zt}  is  again  the 
output  of  a  finite-state  HMM  (albeit  one  with  many  more  possible  states  than  the  one  de¬ 
rived  in  the  white-noise  case  just  considered).  To  see  that  {Zt}  can  indeed  be  characterized 
in  this  way,  first  observe  that  at  a  given  time  t,  the  values  of  ©t  and  0't  provide  current  and 
complete  descriptions  of  the  individual  processes  {Yj}  and  {Vt},  respectively,  in  the  sense 
that  no  further  information  could  be  provided  about  either  process  that  would  improve  our 
predictions  about  its  future  behavior.  Since  {Zt}  is  merely  the  sum  of  {Y£}  and  {Vt},  it 
follows  that  the  pair  of  values  (0£,  Q't)  also  summarizes  all  relevant  information  currently 
available  about  the  underlying  dynamics  of  {Zt},  and  therefore  may  be  considered  a  suitable 
state  variable  for  {Zt}.  Now,  since  the  individual  signal  state  variable  ©£  may  assume  any  of 
the  L  values  {1, 2,  •  •  •  ,  L},  and  the  noise  state  variable  Q't  may  simultaneously  assume  any 
of  the  L'  values  {1, 2,  •  ■  •  ,  L'},  we  conclude  that  the  new  composite  state  variable  (0£,  ©J) 
may  assume  any  of  the  LL'  values  {(1, 1),  (1,2),  •  •  •  ,  (L,L')}. 


Furthermore,  based  on  the  parameter  specifications  for  the  individual  Markov  chains 
{©t}  and  {©£},  as  well  as  on  the  assumption  that  these  two  chains  are  statistically  inde¬ 
pendent,  we  can  directly  calculate  the  parameter  values  that  characterize  the  new  Markov 
“super-chain”  {(0£,0J)}.  In  particular,  the  initial  state  probabilities  for  this  super-chain, 


Chapter  5.  Using  Finite-State  Markov  Models  for  Signal  Estimation 


155 


which  we  denote  by  {P"((z,  *'))},  are  given  by 


—  Pr{©o  =  i,  ©o  =  i'} 

(5.43) 

=  Pr{©0  =  £}Pr{©o  =  i'|©0  =  i} 

(5.44) 

=  Pr{@0  =  i}Pr{©o  =  *"} 

(5.45) 

(5.46) 

The  state  transition  probabilities,  which  we  denote  by  {Q"((i,  i'),  (j, /))},  are  given  by 


0,  U,f))  =  Pr{0t+1  =  j,  ©J+1  =  f  |©t  =  i,  ©;  =  i1}  (5.47) 

=  Pr{0t+1  =  j\Qt  =  *’j  ©t  =  i'}  • 

Pr{©f+1  =  j'\e't  =  ©t  =  t,  ©i+i  =  j}  (5.48) 

=  Pr{0m  =  j\et  =  i}Pr{0't+1  =  =  i'}  (5.49) 

=  Q(i,j)Q'  (*',/)•  (5-50) 


Note  that  in  each  of  the  above  derivations,  we  have  used  the  fact  that  the  original  Markov 
chains  for  the  signal  and  noise  are  statistically  independent. 

The  output  density  associated  with  state  (i,  i')  of  the  new  Markov  super-chain  can  be 
obtained,  as  before,  by  convolving  the  corresponding  signal  and  noise  densities  (?,(•)  and 
g[, (•).  Performing  this  convolution  yields  the  new  pdf 

Mi  K 

h(i,i')(z)  =  Yj  2  PiiP'i’j'M  (*; Pij  +  Pz'j'i  \/aij  +  45')  •  (5-51) 

j= l  j'= i 

This  completes  the  parameter  specification  of  the  new  HMM  for  the  observed  signal  {Zt}. 
Clearly,  the  required  posterior  probabilities  Pr{©t  =  i,  Q’t  =  i'|Zo:jv-i  =  zo:jv-i}  and 
Pr{$t  =  j,  <&'t  =  j'\Qt  =  i,&t  =  i',Zt  —  zt}  can  now  be  computed  exactly  as  they  were 
in  the  previously  considered  non-Gaussian-white- noise  case  (i.e.,  through  the  use  of  stan¬ 
dard  recursions  on  the  forward  and  backward  variables  cit(-)  and  /%(•),  and  through  the 
application  of  Bayes’  rule,  respectively). 

However,  we  once  again  must  pay  a  substantial  premium  in  computation,  this  time  for 
the  added  capability  of  handling  temporally  dependent  non- Gaussian  noise.  To  determine 
how  much  additional  computation  is  needed,  let  us  first  assume  for  convenience  that  M{  = 
M  for  i  =  1,2, •••  ,L,  and  that  M[  =  M'  for  i  =  1, 2, •  •  •  , L' .  Also,  let  us  recall  from 
our  earlier  computational  analysis  that  the  required  number  of  operations  in  the  Gaussian- 
white-noise  case  was  approximately  cML 2  per  sample.  Now,  since  the  number  of  states  in 
the  underlying  Markov  chain  is  LL'.  rather  than  L,  and  the  number  of  components  in  each 
output  pdf  is  MM',  rather  than  M,  the  computational  expense  has  grown  to  approximately 
cMM'L2L'2  operations  per  sample.  Thus,  for  this  case,  we  see  that  total  computation 
increases  by  a  factor  of  M'L'2  over  the  original  Gaussian-white-noise  case,  and  by  a  factor 
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of  L'2 


over  the  non-Gaussian-white-noise  case. 


5.5.3  Separating  Multiple  Linearly  Combined  Non-Gaussian  Signals 

In  the  preceding  sections,  we  have  addressed  the  problem  of  estimating  a  stationary  signal 
of  interest  that  has  been  corrupted  by  additive  stationary  noise,  and  we  have  developed 
extensions  of  our  basic  estimation  technique  in  order  to  deal  with  increasingly  complex 
signal  and  noise  waveforms.  In  this  section,  we  extend  our  basic  technique  even  further  by 
demonstrating  that  it  can  be  applied  to  the  more  general  problem  of  signal  separation,  i.e., 
the  problem  in  which  multiple  signals  of  interest  have  been  added  together  and  corrupted 
by  noise,  and  in  which  each  individual  signal  must  be  optimally  recovered  from  a  single 
finite-length  observation. 

We  have  seen  in  earlier  derivations  of  our  estimation  technique  that,  under  the  HMM- 
based  formulation  in  which  each  output  density  is  a  Gaussian  mixture,  the  estimation 
of  an  entire  waveform  ultimately  reduces  to  solving  numerous  scalar  Gaussian  estimation 
problems  at  each  time  t ,  and  then  nonlinearly  combining  these  intermediate  estimates  (with 
appropriately  defined  posterior  probabilities)  to  produce  the  final  signal  estimate  at  time 
t.  Remarkably,  this  same  basic  sequence  of  operations  (with  only  slight  modifications)  may 
also  be  used  to  solve  the  more  general  problem  in  which  the  values  from  many  signals, 
rather  than  just  one,  have  been  additively  combined  with  noise  at  each  sample. 

Before  delving  into  a  demonstration  of  an  HMM-based  signal  separation  technique,  we 
first  briefly  summarize  the  main  idea  behind  it.  Note  that  this  new  nonlinear  technique  rests 
on  the  assumption  that  all  of  the  signal  and  noise  waveforms  contained  in  the  observation 
are  mutually  independent.  With  this  assumption  in  mind,  let  us  consider  the  problem  of 
optimally  extracting  only  one  of  the  many  signals  of  interest  included  in  the  observed  sum. 
We  can  generate  an  estimate  of  the  desired  signal  using  the  tools  we  have  already  developed, 
once  we  recognize  that  all  other  processes  contained  in  the  observation  —  both  signal  and 
noise  —  can  be  effectively  lumped  together  and  viewed  as  a  single,  monolithic  noise  process 
that  is  to  be  filtered  out.  Of  course,  the  parameters  that  characterize  this  newly  defined 
lumped  noise  process  will  vary  according  to  the  identity  of  the  signal  we  are  currently 
attempting  to  estimate.  Nonetheless,  once  the  appropriate  changes  have  been  made  to  the 
definition  of  the  noise  structure,  the  basic  estimation  technique  can  be  applied  exactly  as 
before.  In  this  way,  each  of  the  signals  contained  in  the  observation  can  be  estimated  in 
turn. 

Our  objective  in  the  remainder  of  this  section  is  to  give  a  concrete  example  of  a  non- 
Gaussian  signal  separation  problem,  and  to  compare  the  performance  of  our  HMM-based 
smoothing  technique  in  this  problem  to  the  conventional  Wiener  smoothing  technique.  In 
particular,  we  shall  consider  an  example  in  which  the  given  observation  is  a  sum  of  three 
independent  waveforms,  two  of  which  are  temporally  dependent  non-Gaussian  signals  of 
interest,  and  the  other  of  which  is  white  Gaussian  noise. 

Let  us  begin  by  specifying  the  statistical  structure  of  the  two  signals  of  interest  contained 
in  the  observation.  The  first  of  these  signals,  which  we  denote  by  {Xt},  is  defined  to  be  the 
output  of  a  nonlinear  autoregressive  system  driven  by  white  noise.  Specifically,  we  assume 
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that  {Xj}  obeys  the  difference  equation6 


ocv 

Xt+i  =  0.5  Xt  +  - -  +  8cos(1.2 1)  +  Wt,  (5.52) 

1  +  xt 

where  the  elements  of  the  sequence  {Wt}  are  i.i.d.  Gaussian  random  variables,  each  having 
a  mean  of  zero  and  a  standard  deviation  of  unity.  A  finite-length  realization  of  {Xt} 
is  depicted  in  Figure  5-4(a).  Also,  in  Figure  5-4(b),  we  show  a  scatter  plot  of  pairs  of 
consecutive  samples  {(zt_i, a;t)},  which  have  been  constructed  directly  from  the  realization 
shown  in  Figure  5-4(a).  This  scatter  plot  reveals  the  unusual  nature  of  the  probability 
distribution  that  characterizes  the  state  variable  of  the  nonlinear  system  described  above. 

The  second  signal  of  interest,  which  we  denote  by  {Yf },  is  defined  to  be  a  discrete- time 
version  of  the  classical  telegraph  signal,  which  switches  back  and  forth  between  two  distinct 
values  according  to  a  simple  probabilistic  rule.  Specifically,  we  assume  that  the  values  taken 
by  {Yj}  are  —20  and  +20,  and  that  consecutive  samples  of  this  signal  obey  the  symmetric 
Markovian  probability  laws 


Pr{Yt+1  =  +20|Yt  =  +20}  =  Pr{Yt+i  =  -20|Yt  =  -20}  =  0.97  (5.53) 

and 

Pr{Ym  =  +20|Yt  =  -20}  =  Pr{Yt+1  =  -20jYt  =  +20}  =  0.03  (5.54) 

for  all  integer  values  of  t.  For  this  example,  we  assume  the  telegraph  signal  is  initialized  at 
time  t  =  0  according  to 


Pr{  Y0  =  +20}  =  Pr{Y0  =  -20}  =  0.5.  (5.55) 

Finally,  the  corrupting  noise  sequence  contained  in  the  observation,  which  we  denote  by 
{Vt},  is  assumed  to  consist  of  i.i.d.  Gaussian  random  variables,  each  having  a  mean  of  zero 
and  a  standard  deviation  of  5.0. 

Given  these  specifications  for  the  signal  and  noise  components,  our  problem  is  now  to 
estimate  —  in  the  MMSE  sense  —  the  particular  values  taken  by  the  random  vectors  Xo:jv_i 
and  Yo:At-i,  based  only  the  value  of  the  observed  vector  Zo:jv-i,  which  is  defined  by 

Zo:AT-l  =  XorTV  — 1  +  Y0:JV_1  +  Vo:AT_l.  (5.56) 

If  we  are  to  apply  our  HMM-based  smoothing  technique  to  solve  this  signal  separation 
problem  (at  least  in  an  approximate  sense),  we  must  first  have  appropriate  models  for 
each  component  of  the  observation.  Fortunately,  for  the  noise  waveform  {Vj},  an  exact, 
degenerate  HMM-based  representation  is  readily  available.  In  particular,  the  corrupting 
noise  sequence  can,  in  its  present  form,  be  viewed  as  the  output  of  a  one-state  HMM,  whose 


6This  particular  random  process  has  appeared  frequently  in  the  literature  on  non-Gaussian  signal  estima¬ 
tion.  For  example,  it  has  been  previously  used  in  the  work  of  Netto  et  al  [131],  Kitagawa  [99],  and  Gordon 
et  al  [68]. 
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Figure  5-4:  Plots  showing  the  temporal  and  statistical  character  of  the  non-Gaussian  signal 
described  in  text:  (a)  300-point  realization  of  signal;  (b)  scatter  plot  of  pairs  {(xj_i,xt)}. 
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i 

P(i) 

Q(v) 

1 

2 

0.50 

0.50 

0.97  0.03 
0.03  0.97 

i 

p{i) 

*■(*) 

P(*) 

1 

-20.0 

0.1 

1.0 

2 

20.0 

0.1 

1.0 

Table  5.6:  Parameter  definitions  for  2-state  HMM  representation  of  the  discrete-time  tele¬ 
graph  signal  {Yt}  discussed  in  the  text.  Prom  top  to  bottom:  specification  of  initial  state 
probabilities  and  state  transition  probabilities  for  Markov  chain;  specification  of  means, 
standard  deviations,  and  weighting  coefficients  for  the  (one-component)  Gaussian-mixture 
pdf  associated  with  each  state. 


initial  state  probability  and  sole  self- transition  probability  are  both  1.0,  and  whose  output 
density  g[(-)  is  defined  by 


g[(v)  ~  N(v,0,5).  (5.57) 

The  discrete-time  telegraph  signal  {Yt}  can  also  be  represented  exactly  by  an  HMM, 
provided  that  we  allow  Dirac  delta  functions  in  the  definitions  of  the  output  densities. 
To  see  this,  observe  that  the  underlying  Markov  chain  associated  with  this  HMM  could 
consist  of  two  states,  one  for  each  possible  value  that  can  be  taken  by  {If}.  The  initial 
state  probabilities  and  state  transition  probabilities  for  such  a  Markov  chain  can  be  inferred 
directly  from  the  formulas  given  in  (5.53),  (5.54),  and  (5.55).  The  output  densities  for  the 
two  states,  which  we  denote  by  gi(-)  and  g2{-),  would  then  be  defined  as 


9i  (v)  =  %  -  20) 

(5.58) 

92{y)  =  %  +  20), 

(5.59) 

where  S(-)  is  the  Dirac  delta  function.  This  definition  causes  some  practical  difficulty,  how¬ 
ever,  for  we  can  not  evaluate  densities  such  as  those  defined  above  during  the  implementation 
of  our  HMM-based  estimation  technique.  Instead,  we  must  settle  for  an  approximation  to  a 
translated  Dirac  delta  function,  the  most  convenient  of  which  is  a  Gaussian  pdf  having  the 
same  mean  value  (i.e.,  either  +20  or  —20),  but  with  an  extremely  small  standard  deviation. 
In  Table  5.6,  we  give  a  complete  specification  for  the  HMM  used  to  approximate  the  signal 
{Yt}  in  this  example. 

To  create  an  HMM-based  representation  for  the  more  complicated  non-Gaussian  signal 
{.Xt},  we  must  rely  on  the  model-building  methods  developed  in  Chapter  4.  For  the  pur¬ 
poses  of  this  example,  we  chose  to  model  {Xt}  using  a  16-state  HMM  in  which  the  output 
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pdf  associated  with  each  state  was  a  three-component  Gaussian  mixture.  Furthermore, 
although  it  is  clear  from  the  autoregression  in  (5.52)  that  a  scalar  state  variable  would  be 
sufficient  to  describe  the  state  of  the  original  nonlinear  system  at  any  time,  we  chose  to 
use  a  two-dimensional  state  vector  in  order  to  improve  the  accuracy  of  the  rather  coarse 
finite-state  approximation.  For  this  reason,  the  16  states  of  the  underlying  Markov  chain 
actually  represent  16  disjoint,  collectively  exhaustive  regions  in  a  two-dimensional  state 
space.  (Recall  that  such  a  space  was  depicted  in  Figure  5-4(b).)  In  Table  5.7,  we  give  a 
complete  specification  of  the  HMM  used  to  approximate  the  signal  {Xt}  in  this  example. 

Using  the  finite-state  models  just  described  for  the  signals  and  noise,  we  can  now  easily 
perform  signal  separation  by  applying  our  new  nonlinear  HMM-based  estimation  algorithm 
to  a  specific  realization  of  the  random  vector  Zo:at_i.  In  order  to  establish  a  useful  point 
of  reference  by  which  we  can  assess  the  resulting  estimation  performance,  we  shall  compare 
the  results  of  our  nonlinear  algorithm  to  those  of  a  conventional  linear  technique,  namely 
the  Wiener  smoother.  Although  the  Wiener  smoother  is  not  a  globally  optimal  MMSE 
estimator  for  this  problem  (owing  to  the  fact  that  the  observation  contains  non-Gaussian 
signals),  it  is  nonetheless  the  best  possible  linear  estimator  that  we  can  use. 

The  Wiener  smoother  associated  with  each  signal  of  interest  can  be  implemented  through 
straightforward  matrix- vector  multiplication.  However,  we  must  first  know  the  values  of  the 
covariance  matrices  associated  with  the  three  constituent  random  vectors  Xo;j\r_i,  Yo,jv-i, 
and  VoiN-i  making  up  the  observation.  Since  each  of  these  random  vectors  represents 
a  section  of  a  stationary  signal,  each  associated  covariance  matrix  possesses  a  Toeplitz 
structure;  moreover,  because  the  random  vectors  are  all  zero-mean,  each  covariance  matrix 
is  specified  entirely  by  the  first  N  values  of  the  associated  autocorrelation  function  (i.e., 
autocorrelation  values  ranging  from  the  0th  lag  up  to  and  including  the  (N  —  l)th  lag). 

In  light  of  the  definitions  given  earlier,  we  see  that  the  kth  lag  of  the  autocorrelation 
function  for  the  noise  waveform  {V*}  is  given  by 


E{VtVt+k}  =  25  •  6t,k,  (5.60) 

where  6t}k  is  the  Kronecker  delta  sequence  (i.e.,  a  sequence  which  has  a  value  of  unity  if 
k  =  t,  but  otherwise  is  identically  zero).  It  follows  that  the  covariance  matrix  for  the  vector 
V0:Ar_i  is  just  a  scaled  version  of  the  identity  matrix,  where  the  scale  factor  is  the  value  of 
the  noise  variance  at  each  sample. 

The  autocorrelation  function  of  the  discrete-time  telegraph  waveform  {Ft}  can  also  be 
obtained  in  closed  form.  In  fact,  it  can  be  shown  (after  a  significant  amount  of  analysis 
of  the  structure  of  the  underlying  Markov  chain)  that  the  kth  lag  of  the  autocorrelation 
function  for  the  telegraph  waveform  is  given  by 


E{YtYt+k}  =  0.94^  ■  400.  (5.61) 

Using  this  formula,  we  can  construct  the  covariance  matrix  for  Yo:Jv-i  simply  by  repeating 
the  value  of  the  kth  lag  along  both  the  Arth  sub-diagonal  and  the  Icth  super-diagonal  of  the 
N  x  N  matrix,  for  k  =  0, 1,  •  •  •  ,1V  —  1. 

Unfortunately,  the  mathematical  definition  of  the  remaining  non-Gaussian  signal  {At} 
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Q(i,  1),  ■  - 

•  ,Q(h  8) 
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0 
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0 
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Mb') 

<r(t) 

P(i) 

i 

5.60 

3.75 

1.76 

E£H 

0.30 

0.46 

2 

16.51 

12.52 

14.20 

1.67 

0.45 

0.34 

0.22 

0.44 

3 

9.88 

8.24 

11.09 

0.77 

0.92 

0.64 

0.31 

0.45 

0.23 

4 

14.21 

16.34 

15.02 

1.86 

0.91 

1.08 

0.21 

0.13 

0.67 

5 

16.45 

12.52 

14.62 

1.10 

1.27 

1.07 

0.17 

0.27 

0.56 

6 

7.02 

6.72 

5.09 

1.40 

1.44 

1.02 

0.25 

0.23 

0.52 

7 

2.55 

1.29 

3.38 

0.90 

1.01 

0.60 

0.37 

0.35 

0.27 

8 

-0.42 

-0.12 

-1.72 

0.79 

1.18 

0.54 

0.39 

0.50 

0.11 

9 

-5.60 

-3.75 

-1.76 

0.75 

0.91 

0.91 

0.24 

0.30 

0.46 

10 

-16.51 

-12.52 

-14.20 

1.67 

0.45 

0.93 

0.34 

0.22 

0.44 

11 

-9.88 

-8.24 

-11.09 

0.77 

0.92 

0.64 

0.31 

0.45 

0.23 

12 

-14.21 

-16.34 

-15.02 

1.86 

0.91 

1.08 

0.21 

0.13 

0.67 

13 

-16.45 

-12.52 

-14.62 

1.10 

1.27 

1.07 

0.17 

0.27 

0.56 

14 

-7.02 

-6.72 

-5.09 

1.40 

1.44 

1.02 

0.25 

0.23 

0.52 

15 

-2.55 

-1.29 

-3.38 

0.90 

1.01 

0.60 

0.37 

0.35 

0.27 

16 

0.42 

0.12 

1.72 

0.79 

1.18 

0.54 

0.39 

0.50 

0.11 

Table  5.7:  Parameter  definitions  for  16-state  HMM  representation  of  the  non-Gaussian 
signal  {Xt}  discussed  in  the  text.  Prom  top  to  bottom:  specification  of  initial  state  proba¬ 
bilities  and  state  transition  probabilities  for  Markov  chain;  specification  of  means,  standard 
deviations,  and  weighting  coefficients  for  Gaussian-mixture  pdf  associated  with  each  state. 
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makes  it  extremely  difficult  to  solve  for  the  covariance  matrix  of  Xo:jv-i  in  closed  form. 
Thus,  for  this  signal  we  resort  to  the  numerical  technique  of  computing  the  sample  autocor¬ 
relation  function  based  on  a  very  long  realization  of  {Xt}  (specifically,  a  realization  having 
a  length  of  100,000  samples);  we  can  then  construct  the  corresponding  covariance  matrix 
just  as  we  did  before,  by  specifying  its  diagonals  one  at  a  time  based  on  the  lags  of  the 
autocorrelation  function. 

Having  defined  all  of  the  necessary  elements  for  implementing  both  the  Wiener  smoother 
and  the  HMM-based  smoother,  let  us  now  consider  performing  signal  separation  using  a 
specific  realization  of  the  observed  vector  Zo:jv-i-  In  Figures  5-5(a)  through  5-5(d),  we 
show,  respectively,  realizations  of  the  individual  vectors  Xo:jv_i,  Yo:jv_i,  and  and 

their  sum  Zo:Ar-i;  here,  we  have  arbitrarily  chosen  the  observation  length  N  =  150.  When 
the  Wiener  smoothing  technique  is  applied  to  the  observed  waveform  shown  in  Figure  5- 
5(d),  we  obtain  the  estimated  waveforms  shown  in  Figure  5-6.  Recall  that,  because  there 
are  three  distinct  random  processes  that  make  up  the  observation  in  this  example,  there 
are,  accordingly,  three  distinct  Wiener  smoothers  that  must  be  applied  to  the  observation 
in  order  to  separate  these  processes.  The  smoothers  that  are  designed  to  extract  the  vectors 
Xo:at-i,  Y0;Ar_i,  and  V0:jv-i  generate  the  estimates  shown  in  Figures  5-6(a),  5-6(b),  and 
5-6  (c),  respectively.  The  sum  of  these  estimated  waveforms  is  also  shown  in  Figure  5-6  (d). 

Let  us  now  compare  these  baseline  estimates  with  the  corresponding  estimates  produced 
by  the  HMM-based  smoothing  technique,  which  are  shown  in  Figures  5-7(a)  through  5-7(d). 
By  visually  comparing  and  contrasting  the  estimated  waveforms  displayed  in  Figures  5- 
6  and  5-7,  along  with  their  true  original  counterparts  shown  in  Figure  5-5,  we  see  that 
the  results  of  the  HMM-based  smoothing  technique  appear  to  be  superior  to  those  of  the 
Wiener  smoothing  technique.  In  a  moment,  we  will  provide  quantitative  evidence  supporting 
this  assertion.  Interestingly,  an  attribute  shared  by  both  of  the  signal  separation  methods 
presented  here  is  that  the  sum  of  the  three  estimated  waveforms  is  always  the  same  as 
the  sum  of  three  original  waveforms.  Hence,  the  plots  appearing  in  Figures  5-5(d),  5-6(d), 
and  5-7(d)  are  actually  identical.  This  property  follows  directly  from  the  mathematical 
definitions  of  the  estimates  produced  by  each  technique. 

To  obtain  a  more  precise  numerical  characterization  of  the  performance  of  each  smooth¬ 
ing  technique,  we  repeated  the  above  signal  separation  experiment  a  total  of  1000  times 
using  randomly  generated  observations.  On  each  trial,  we  recorded  the  error  incurred  by 
each  smoothing  technique  after  estimating  each  signal  of  interest  from  the  observation.  (Es¬ 
timation  of  the  noise  waveform  was  considered  unimportant,  and  hence  the  results  for  this 
waveform  were  not  examined.)  The  measure  of  performance  used  on  each  trial,  for  each 
signal  of  interest,  was  the  realized  mean  squared  error  (MSE)  value,  i.e.,  the  arithmetic 
average  of  the  150  real  numbers  obtained  by  subtracting  the  actual  waveform  from  the 
estimated  waveform  and  squaring  the  resulting  residual  value  at  each  time  index.  After  all 
1000  trials  had  been  performed,  we  were  left  with  a  collection  of  1000  such  MSE  scores  for 
each  of  four  separate  cases,  representing  the  results  of  applying  each  of  the  two  smoothing 
techniques  to  extract  each  of  the  two  signals  of  interest. 

The  sample  mean  and  sample  standard  deviation  of  the  MSE  value  for  each  possible 
case  are  displayed  in  Table  5.8.  Observe  that,  when  the  non-Gaussian  signal  {Xt}  is  being 
estimated,  the  HMM-based  smoother  yields,  on  the  average,  an  MSE  value  that  is  nearly 
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Figure  5-5:  Plots  of  constituent  waveforms  used  for  signal  separation  problem:  (a)  realiza¬ 
tion  of  non-Gaussian  signal;  (b)  realization  of  discrete-time  telegraph  signal;  (c)  realization 
of  white  Gaussian  noise;  (d)  superposition  of  waveforms  shown  in  (a),  (b)  and  (c). 
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Figure  5-6:  Plots  of  signal  separation  results  using  Wiener  smoother:  (a)  estimate  of  non- 
Gaussian  signal;  (b)  estimate  of  discrete-time  telegraph  signal;  (c)  estimate  of  white  Gaus¬ 
sian  noise;  (d)  superposition  of  waveforms  shown  in  (a),  (b)  and  (c). 
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(a) 


(b) 


Figure  5-7:  Plots  of  signal  separation  results  using  HMM-based  smoother:  (a)  estimate  of 
non-Gaussian  signal;  (b)  estimate  of  discrete-time  telegraph  signal;  (c)  estimate  of  white 
Gaussian  noise;  (d)  superposition  of  waveforms  shown  in  (a),  (b)  and  (c). 
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ESTIMATION  OF  {A*}  GIVEN  {Zt} 

Type  of 

Sample  Mean 

Sample  St.  Dev. 

Estimator 

of  MSE  Value 

of  MSE  Value 

LINEAR 

51.6 

9.8 

HMM-BASED 

18.1 

5.9 

ESTIMATION  OF  {Ft}  GIVEN  {Zt} 

Type  of 

Sample  Mean 

Sample  St.  Dev. 

Estimator 

of  MSE  Value 

of  MSE  Value 

LINEAR 

71.8 

12.9 

HMM-BASED 

11.6 

10.4 

Table  5.8:  Results  of  the  1000-trial  experiment  designed  to  compare  the  performance  of  the 
optimal  linear  estimator  and  the  optimal  HMM-based  estimator  in  the  signal  separation 
problem.  Sample  means  and  sample  standard  deviations  of  the  MSE  value  are  shown  for 
both  estimation  techniques  for  the  non-Gaussian  signal  {Xt}  (top)  and  for  the  telegraph 
signal  {It}  (bottom). 


three  times  smaller  than  the  value  given  by  the  Wiener  smoother.  When  the  telegraph  signal 
{Yt}  is  being  estimated,  the  HMM-based  smoother  yields  an  MSE  value  that  is  more  than 
six  times  smaller  than  the  value  given  by  the  Wiener  smoother.  Moreover,  in  both  of  these 
cases,  the  standard  deviation  of  the  MSE  value  associated  with  the  HMM-based  method 
is  lower  than  the  standard  deviation  associated  with  the  Wiener  method.  Thus,  we  see 
that  the  HMM-based  smoother  performs  significantly  better  than  the  conventional  Wiener 
smoother  for  the  particular  non-Gaussian  signal  separation  problem  considered  here. 


5.6  Discussion 

5.6.1  Using  Finite-State  Models  for  Problems  Other  than  Smoothing 

Although  we  have  focused  exclusively  in  this  chapter  on  the  problem  of  signal  smoothing, 
it  should  be  clear  that  our  fundamental  approach  of  finite-state  signal  modeling  can  be 
applied  to  a  variety  of  other  signal  processing  problems  as  well.  Specifically,  some  of  the 
most  obvious  and  immediate  applications  of  our  basic  modeling  paradigm  would  include 
other  variations  on  the  signal  estimation  problem,  e.g.,  variations  such  as  signal  filtering  and 
signal  prediction.  In  this  section,  we  draw  heavily  from  the  concepts  developed  earlier  for  the 
smoothing  problem  in  order  to  explain  how  these  alternative  signal  estimation  problems  can 
also  be  solved.  We  then  turn  our  attention  to  an  entirely  different  class  of  signal  processing 
problems  —  namely  problems  in  signal  detection  and  classification  (i.e.,  M- ary  hypothesis 
testing)  —  and  show  that  approximate  solutions  to  these  problems  can  also  be  constructed 
using  the  finite-state  approach. 
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5.6. 1.1  Extensions  to  Filtering  and  Prediction  Problems 

We  begin  by  considering  the  problems  of  signal  filtering  and  prediction.  The  filtering  prob¬ 
lem  differs  fundamentally  from  the  smoothing  problem  in  that  we  are  allowed  to  use  only 
those  observations  that  are  available  at  or  before  time  t  to  produce  an  estimate  of  the  true 
signal  value  at  time  t,  i.e.,  we  are  prohibited  from  introducing  a  delay  in  order  to  use  any 
additional  observations  that  become  available  in  the  future.  On  the  other  hand,  in  the 
prediction  problem,  we  use  observations  that  are  available  at  or  before  the  current  time  t 
to  estimate  the  true  signal  value  at  some  future  time  t  +  k. 

In  either  type  of  problem,  HMM-based  estimation  techniques  can  still  be  employed 
successfully.  Whenever  such  techniques  are  used,  whether  for  filtering  or  prediction,  the  first 
and  most  important  step  to  be  performed  is  the  calculation  of  the  posterior  pmf  —  based 
on  all  data  observed  up  to  time  t  —  for  the  state  variable  of  the  underlying  Markov  chain 
at  time  t.  This  critical  first  step  can  be  carried  out  by  using  a  simple  recursive  procedure, 
whereby,  at  each  time  index,  the  previously  computed  values  of  the  posterior  state  pmf 
become  updated  at  the  moment  the  current  sample  in  the  observed  sequence  becomes 
available.  It  is  straightforward  to  show  that  a  recursive  procedure  for  accomplishing  this 
task  can  be  constructed  directly  from  the  forward  recursion  described  in  Section  F.l.  In 
fact,  we  note  that  the  desired  posterior  probabilities  in  this  case  can  actually  be  expressed 
in  terms  of  the  forward  variable  at{i)  defined  earlier,  as  shown  by 


r>_ f _  -Iry _ 1  /Zo;t,©t  (Z0:t)  ©t  ®) 

rr{Ot  —  i|Zo;t  —  zo:tj  — - - - - r - 

f  Zo:t  (z0 :t) 

_  /z0:t,et(zo:t,©t  =  f) 

5Zj=l  /Zo:t>©t  (Z0:t;  ©t  j  ) 

_  at(i) 

f=i  <*t(j) 

Once  the  current  posterior  pmf  has  been  computed,  the  remaining  steps  in  generating 
an  estimate  differ  according  to  whether  filtering  or  prediction  is  being  performed.  The  steps 
involved  in  prediction  are  quite  simple.  In  this  case,  we  merely  need  to  project  the  posterior 
state  pmf  for  the  current  time  t  out  to  the  future  time  t+k,  so  that  it  actually  represents  the 
state  pmf  at  time  t  +  k  based  on  all  data  observed  up  to  time  t.  This  projection  is  achieved 
by  multiplying  the  current  posterior  state  pmf  (taken  to  be  a  row  vector)  by  the  fcth-order 
state  transition  matrix  (i.e.,  the  matrix  which  is  constructed  by  multiplying  the  ordinary 
state  transition  matrix  by  itself  k  times,  and  whose  (i,j)  entry  represents  the  probability 
that  the  Markov  chain  will  be  in  state  j  at  time  t  +  k  given  that  it  is  in  state  i  at  time 
t).  Once  this  projection  has  been  accomplished,  the  optimal  prediction  of  the  signal  value 
at  time  t  +  k  is  simply  a  weighted  average  of  the  mean  values  associated  with  the  output 
densities  of  the  HMM;  the  weighting  coefficients  used  in  this  average  are  just  the  elements 
of  the  (projected)  posterior  state  pmf  at  time  t  +  k. 

In  the  filtering  problem,  the  remaining  steps  in  generating  a  signal  estimate  are  carried 
out  exactly  as  they  were  in  the  smoothing  problem.  In  particular,  for  each  state  of  the 
Markov  chain,  the  posterior  sub-state  probabilities  at  time  t  are  first  computed  based  on 


(5.62) 

(5.63) 

(5.64) 
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the  value  of  the  observation  at  time  t.  (Recall  that  each  sub-state  probability  indicates 
the  relative  likelihood  that  a  particular  component  of  the  Gaussian-mixture  pdf  associated 
with  that  state  was  active  at  time  t,  given  that  the  Markov  chain  was  actually  in  that 
state  at  time  t.)  Once  these  probabilities  have  been  computed,  a  conditional  mean  value 
can  be  obtained  for  a  given  state  of  the  Markov  chain  by  taking  a  weighted  average  of 
the  mean  values  associated  with  the  conditional  densities  in  the  Gaussian  mixture  for  that 
state;  the  weighting  coefficients  used  in  this  average  are  simply  the  posterior  sub-state 
probabilities.  Finally,  the  optimal  estimate  of  the  signal  value  at  time  t  is  a  weighted 
average  of  these  resulting  conditional  mean  values  associated  with  the  states  of  the  Markov 
chain;  the  weighting  coefficients  used  in  this  final  average  are  just  the  elements  of  the 
posterior  state  pmf  at  time  t. 


5.6. 1.2  Extension  to  Multi-Class  Hypothesis  Testing 

To  demonstrate  that  certain  signal  processing  problems  other  than  signal  smoothing,  filter¬ 
ing,  or  prediction  can  also  be  addressed  using  the  finite-state  modeling  paradigm,  let  us  now 
turn  our  attention  to  the  problem  of  binary  (or,  in  the  more  general  case,  M- ary)  hypothesis 
testing.  In  the  general  version  of  this  problem,  the  waveform  we  observe  is  known  to  be  a 
realization  from  one  of  M  distinct  signal  distributions  or  classes.  Our  goal  in  processing  the 
observation  is  to  determine  the  true  class  from  which  it  came,  based  on  our  prior  knowledge 
about  each  of  the  signal  classes  involved.  A  simple,  classical  example  of  M- ary  hypothesis 
testing  is  the  signal  detection  problem,  in  which  we  have  only  a  signal-plus-noise  class  and 
a  noise-only  class,  and  we  wish  to  determine  whether  the  signal  of  interest  is  present  or 
absent  in  the  given  observation. 

It  is  well  known  that  the  optimal  decision  rule  for  the  M-ary  hypothesis  testing  problem 
—  in  the  sense  that  it  yields  the  smallest  probability  of  making  a  classification  error  — 
is  the  maximum  a  posteriori  (MAP)  rule  [15,  166,  215].  To  express  this  rule  precisely,  let 
us  denote  the  JV-sample  observed  signal  by  Zo:jv-i,  the  given  realization  of  this  signal  by 
zo-.N- 1,  the  M  classes  themselves  by  {1, 2,  ■  •  •  ,  M},  and  the  hypothesis  that  class  k  is  the 
true  class  of  the  observed  signal  by  Hk-  In  addition,  let  k  represent  our  estimate  of  the  true 
class.  Then,  by  the  MAP  rule,  this  estimate  k  is  defined  as 

k=  argmax  logPr{.ff*|Zo:Ar-i  =  z0:n-i},  (5.65) 


That  is,  class  k  is,  among  all  M  classes,  the  one  whose  posterior  probability  is  largest  after 
we  have  observed  the  event  Zo:at-i  =  zoiiv-i-  The  logarithm  in  the  above  expression  (which 
is  a  monotonically  increasing  function  and  therefore  has  no  effect  on  the  argument  of  the 
maximization)  has  been  introduced  only  to  simplify  later  calculations. 

We  will  find  it  convenient  to  re-express  the  above  posterior  probability  (through  an 
application  of  Bayes’  rule)  as 


Fr{Hk\Z0:N-i  =  zo:at_i}  = 


Pr{.ffjfc}/Zo:JV_i|g).  (z0:jv-i \Hk) 

/Z0;iV_l(Z0:iV-l) 


(5.66) 
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where  Pr{Hk}  is  the  prior  probability  of  the  event  Hk,  /z0Jv-i|tf/=(‘)  *s  t*ie  conditional 
density  of  Zo;jv-i  given  that  the  event  if*  has  occurred,  and  fz0:N-A')  tbe  unconditional 
density  of  Zo;tv-i-  Because  the  denominator  in  the  above  expression  is  a  positive  constant 
independent  of  k.  we  can  ignore  it  when  performing  the  maximization  over  all  classes.  With 
this  modification,  we  can  then  express  k  equivalently  as 

k=  argmax  {logPr{i?fc}  +  tegfz0.N_l\Hk(*0:rf-i\Hk)}  •  (5-67) 

Since  each  prior  probability  Pr{i2*;}  is  known,  the  only  remaining  quantity  that  must  be 
computed  is  the  value  of  the  conditional  density  /z0.N_1|j/fc(zO:JV-i|-Hjt)  for  each  k. 

We  now  invoke  the  assumption  that  the  observed  signal  Zo;jv-i  is  (either  exactly  or 
approximately)  the  output  of  a  known,  unique  finite-state  HMM  under  each  of  the  M 
hypotheses.  With  this  assumption,  we  can  use  algorithms  derived  earlier  to  compute  the 
associated  forward  recursion  variable  a^(-)  for  each  hypothesis  Hk  and  for  each  time  index 
t.  This  forward  variable  is  important  because  it  can  be  manipulated  to  give  the  density 
value  we  need,  as  shown  by 


L 

log/Zo^-ilffi^AT-ll-fffc)  =  log^-fZ0:N-l,©jv-i|//fc  (z0:JV-l,©jV-l  =  j\Hk)  (5.68) 

J=1 

L 

—  log  ak,N-i(j)-  (5.69) 

j= i 

As  we  discuss  in  Appendix  F,  however,  the  forward  recursion  variable  cannot  be  evaluated 
directly  on  a  computer  (unless  N  is  very  small),  because  the  arithmetic  operations  required 
to  evaluate  it  at  each  time  index  will  eventually  exceed  the  dynamic  range  of  essentially  any 
machine  without  the  use  of  a  special  scaling  procedure.  Ironically,  it  is  also  demonstrated 
in  Appendix  F  that,  as  a  by-product  of  the  properly  conditioned  version  of  the  forward 
recursion,  we  obtain  a  set  of  scaling  coefficients  that  are  in  fact  the  key  to  evaluating  the 
remaining  terms  needed  in  (5.67). 

If  we  denote  the  set  of  scaling  coefficients  associated  with  hypothesis  Hk  by 
then,  based  on  arguments  put  forth  in  Appendix  F,  we  see  that  the  desired  value  of  the 
conditional  density  of  the  observation  can  be  computed  in  terms  of  these  scaling  coefficients 
as 


TV— 1 

log/z0;N-i|Hfc(zo:TV-i|-fffc)  =  log  ckt,  k  =  1, 2,  •  •  •  ,  M.  (5.70) 

4=0 

With  the  ability  to  evaluate  these  remaining  M  terms,  we  can  now  easily  solve  the  maximiza¬ 
tion  in  (5.67),  and  thus  make  efficient  use  of  the  MAP  rule  under  the  finite-state  modeling 
paradigm.  In  many  cases,  using  this  technique  could  allow  us  to  attain  near-optimal  M- ary 
signal  classification  performance,  provided  that  we  use  signal  models  of  sufficiently  high 
fidelity. 
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5.6.2  HMM-Based  Signal  Estimation  in  Non-Stationary  Noise 

A  potentially  useful  avenue  for  future  research  would  be  to  explore  the  design  of  HMM- 
based  signal  estimation  schemes  for  environments  in  which  the  noise  is  non-stationary.  To 
develop  this  idea  further,  let  us  return  briefly  to  the  Gaussian  estimation  problem  considered 
in  Section  5.4  and  make  some  additional  observations  concerning  the  results  plotted  in 
Figure  5-3  (b).  In  particular,  observe  from  this  figure  that  for  the  two  smallest  tested  values 
of  SNRjn,  the  performance  curves  become  relatively  flat  after  only  three  states  have  been 
included  in  the  signal  model.  This  implies,  at  least  for  the  Gaussian  estimation  problem 
considered  in  our  example,  that  using  an  HMM  containing  any  more  than  three  states  is 
wasteful  in  an  environment  where  the  SNR  is  low. 

One  way  in  which  we  might  apply  a  conclusion  of  this  kind  is  to  incorporate  it  into  an 
SNR-dependent  metric  for  choosing  the  appropriate  HMM  order  as  part  of  the  model  design 
process.  Alternatively,  we  could  use  it  to  develop  an  estimator  that  is  capable  of  filtering  out 
a  white  Gaussian  noise  process  whose  power  level  is  changing  over  time.  In  this  generalized 
version  of  the  original  estimation  problem,  we  might  not  always  insist  on  having  the  best 
available  signal  estimate;  instead,  we  may  prefer  to  have  a  reasonably  good  estimate  which 
can  be  produced  at  a  modest  computational  cost.  The  associated  estimation  algorithm 
would  require  access  to  multiple  HMM-based  representations  for  the  signal,  each  having  a 
unique  number  of  states  and  hence  a  unique  degree  of  fidelity.  This  estimator  would  use  a 
predetermined  rule  for  optimally  trading  off  computation  for  performance  as  a  function  of 
the  current  SNR.  Future  work  could  be  aimed  at  developing  the  overall  estimation  algorithm 
for  this  more  complex  situation.  One  must  determine,  for  example,  how  to  estimate  the 
current  noise  level,  how  to  decide  whether  to  switch  from  the  current  signal  model  to  a 
different  model  (as  well  as  how  often  this  switching  decision  must  be  made),  and  how  to 
select  the  best  model  among  all  available  models. 


Chapter  6 


Summary  and  Future  Directions 

6.1  Synopsis 


The  central  goal  of  this  thesis  has  been  to  develop  a  new  statistical  framework  for  ana¬ 
lyzing  and  processing  stationary  non-Gaussian  signals.  A  unifying  theme  of  the  concepts 
presented  in  the  thesis  has  been  our  consideration  of  two  fundamental  inference  problems 
that  often  arise  in  practical  situations,  namely  (i)  source  identification  (i.e.,  estimation  of 
the  parameters  of  a  signal  source  based  on  a  mathematical  model  of  the  source  and  an 
uncorrupted  observation  of  the  source  output);  and  (ii)  signal  estimation  (i.e.,  recovery  of 
the  signal  values  themselves  based  on  complete  parametric  knowledge  of  the  measurement 
model  and  a  noisy  observation  of  the  signal).  The  results  following  from  our  analyses  of 
these  two  inference  problems  constituted  the  technical  core  of  the  thesis. 

The  main  body  of  technical  material,  which  was  presented  in  Chapters  2  through  5, 
was  logically  divided  into  two  parts,  according  to  the  type  of  mathematical  model  that  was 
used  to  define  the  structure  of  the  source  signal.  The  first  part,  which  consisted  solely 
of  Chapter  2,  dealt  with  the  two  basic  inference  problems  under  the  assumption  that  the 
signal  was  produced  by  an  ARGMIX  source  (i.e.,  an  autoregressive  LTI  system  driven  by 
i.i.d.  Gaussian-mixture  noise).  The  second  part,  which  consisted  of  Chapters  3,  4,  and  5, 
developed  an  entirely  new  signal  model  in  order  to  overcome  some  of  the  computational 
difficulties  imposed  by  the  ARGMIX  assumption.  In  this  part  of  the  thesis  we  developed  the 
notion  that  a  stationary  non-Gaussian  signal  could  be  approximated  by  a  finite-state  hidden 
Markov  model  (HMM);  we  then  showed  how  such  an  approximation  could  be  manipulated 
to  produce  accurate  and  efficient  solutions  to  the  two  basic  inference  problems. 

Through  our  investigation  of  these  alternative  signal  models,  we  developed  a  new  set 
of  concepts  and  techniques  for  dealing  with  non-Gaussian  problems,  and  we  encountered  a 
number  of  potentially  rich  areas  for  further  exploration.  In  the  remainder  of  this  chapter, 
we  provide  a  summary  of  the  main  contributions  of  the  thesis  and  suggest  several  topics  for 
future  investigation.  A  number  of  open  issues  and  potential  research  topics  have  already 
been  identified  in  the  individual  chapter  discussions  and  therefore  will  not  be  repeated  here. 
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6.2  Summary  of  Thesis  Contributions 

6.2.1  Development  of  ARGMIX  Parameter  Identification  Algorithm 

We  developed  a  new  iterative  technique  for  identifying  the  parameters  of  an  ARGMIX 
process  based  on  a  finite-length  realization  of  such  a  process.  This  technique,  which  we  refer 
to  as  the  EMAX  algorithm,  was  derived  using  the  generalized  expectation-maximization 
principle.  The  strength  of  the  EMAX  algorithm  lies  in  its  ability  to  identify  both  the  shape 
of  the  driving-noise  pdf  and  the  LTI  system  that  gave  rise  the  signal.  We  demonstrated  in 
several  numerical  examples  that  the  estimation  performance  of  the  EMAX  algorithm  was 
superior  not  only  to  that  of  traditional  least-squares  techniques,  but  also  to  that  of  other 
existing  algorithms  that  are  also  based  on  the  ARGMIX  signal  model.  We  also  developed 
an  alternative  form  of  the  EMAX  algorithm  to  solve  a  restricted  version  of  the  original 
ARGMIX  source  identification  problem.  The  assumptions  of  this  restricted  problem  were 
more  closely  matched  to  those  of  the  classical  AR  Gaussian  source  identification  problem, 
in  that  the  basic  shape  of  the  driving-noise  pdf  was  assumed  known  but  the  scale  of  the  pdf 
was  assumed  unknown.  Although  this  alternative  version  of  the  EMAX  algorithm  required 
more  a  priori  signal  information  than  the  original,  it  had  better  convergence  properties 
because  it  was  designed  to  operate  on  a  likelihood  function  that  had  no  singularities. 

6.2.2  Formulation  of  HMM-Based  Signal  Approximation  Concept 

We  formulated  and  developed  the  novel  concept  that  an  arbitrary  stationary  AR  signal 
could  be  approximated  as  the  output  of  a  finite-state  hidden  Markov  model.  The  HMM- 
based  signal  model  was  introduced  as  an  alternative  to  the  ARGMIX  model  to  reduce  the 
computational  burden  incurred  under  the  ARGMIX  assumption;  a  reduction  in  computation 
was  deemed  possible  because  an  HMM  has  a  simple  probabilistic  structure  that  can  be 
specified  using  only  a  small  number  of  parameters.  To  develop  this  new  model,  we  considered 
the  optimization  problem  in  which  exact  knowledge  of  the  true  signal  pdf  is  given  and  the 
best  HMM-based  approximation  to  this  pdf  is  to  be  found.  The  optimization  was  carried  out 
under  the  constraint  that  the  states  of  the  underlying  Markov  chain  represent  a  collection 
of  disjoint  regions  making  up  a  partition  of  the  original  state  space.  Using  the  Kullback- 
Leibler  distance  as  our  figure  of  merit,  we  first  derived  optimal  parameter  values  for  the 
approximating  HMM  directly  in  terms  of  the  true  signal  pdf,  under  the  assumption  that  the 
state-space  partition  was  fixed.  We  then  showed  that  the  best  partition  was  the  one  that 
maximized  the  mutual  information  between  state  values  of  the  underlying  Markov  chain  at 
successive  time  steps.  Although  most  of  our  initial  analysis  assumed  that  the  true  signal  was 
a  first-order  AR  process,  we  also  showed  that  the  same  basic  results  applied  to  higher-order 
AR  processes. 

6.2.3  Development  of  HMM  Parameter  Identification  Algorithm 

We  constructed  a  practical  iterative  algorithm  for  estimating  the  parameters  of  an  optimal 
HMM-based  approximation  of  a  stationary  AR  process  based  only  on  a  finite-length  realiza¬ 
tion  of  such  a  process,  rather  than  on  a  complete  description  of  its  pdf.  This  algorithm  can 
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thus  be  viewed  as  an  approximate  solution  to  the  general  AR  source  identification  problem. 
The  algorithm  was  configured  to  select  a  feasible  initial  partition  of  the  original  state  space, 
then  to  iteratively  adjust  the  region  boundaries  of  the  partition  until  the  optimal  partition 
is  reached,  and  finally  to  compute  the  HMM  parameter  estimates  based  on  the  distribution 
of  data  points  among  the  resulting  regions.  The  basic  ideas  that  guided  the  iterative  search 
portion  of  the  algorithm  were  based  on  the  theoretical  results  derived  in  the  HMM-based 
signal  approximation  problem.  Although  several  techniques  have  been  developed  by  other 
researchers  for  estimating  the  parameters  of  an  HMM,  these  techniques  are  typically  de¬ 
signed  to  optimize  a  likelihood-based  criterion  rather  than  a  mutual  information  criterion; 
moreover,  they  are  not  equipped  to  handle  the  state-space  partitioning  constraint. 

6.2.4  Development  of  HMM-Based  Signal  Estimation  Techniques 

We  developed  a  collection  of  techniques  for  performing  MMSE  signal  estimation  based  on  the 
assumption  that  both  the  signal  and  noise  processes  are  outputs  of  finite-state  HMMs.  These 
techniques  also  relied  heavily  on  the  assumption  that  the  pdf  associated  with  any  state  of 
either  HMM  is  a  Gaussian  mixture.  We  began  our  development  by  constructing  a  smoothing 
algorithm  for  the  simple  case  in  which  the  signal  and  noise  were,  respectively,  colored  and 
white  Gaussian  processes.  For  this  case,  we  also  evaluated  estimation  performance  as  a 
function  of  the  order  of  the  HMM-based  approximation  and  compared  the  results  to  those 
of  the  globally  optimal  Wiener  smoother;  we  found  that  near-optimal  performance  could 
be  achieved  when  this  HMM  contained  only  a  small  number  of  states.  We  then  extended 
the  basic  smoothing  algorithm  so  that  it  applied  to  the  case  in  which  both  the  signal  and 
noise  were  allowed  to  be  colored  non-Gaussian  processes.  In  addition,  we  indicated  how 
similar  algorithms  could  be  developed  for  the  problems  of  filtering  and  prediction.  These 
HMM-based  estimation  algorithms  are  quite  general  and  powerful  signal  processing  tools; 
they  consume  only  a  modest  amount  of  computation  and  can  be  applied  in  an  extremely 
broad  range  of  non-Gaussian  problems. 

6.3  Directions  for  Future  Research 

6.3.1  Consideration  of  More  Realistic  Inference  Problems 

Clearly,  the  problems  of- source  identification  and  signal  estimation,  as  they  have  been 
defined  in  the  thesis,  are  idealized  versions  of  more  complicated  problems  encountered  in 
practice.  There  are  many  practical  situations,  for  example,  in  which  we  would  like  to  identify 
the  parameters  of  a  signal,  but  we  have  only  noisy  observations  of  the  signal  available 
to  carry  out  the  identification  procedure.  On  the  other  hand,  there  are  also  situations 
in  which  we  would  like  to  estimate  a  signal  in  additive  noise,  but  we  have  only  partial 
knowledge  of  the  parametric  measurement  model.  Both  types  of  situations  call  for  the 
solution  of  a  joint  inference  problem  involving  aspects  of  both  source  identification  and 
signal  estimation.  Consideration  of  such  a  problem  was  well  beyond  the  scope  of  the  thesis, 
mainly  because  it  was  too  complex  to  make  a  convenient  starting  point  for  the  development 
of  an  inference  framework.  Now  that  we  have  made  some  initial  progress  on  the  idealized 
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inference  problems,  however,  it  appears  that  a  logical  direction  for  further  investigation  is 
to  address  more  realistic  versions  of  these  problems  using  either  of  the  two  signal  models 
we  have  introduced. 

6.3.2  Streamlining  of  HMM-Based  Signal  Estimation  Algorithm 

An  area  in  which  future  research  will  almost  certainly  be  beneficial  is  the  streamlining  of 
computation  in  the  basic  HMM-based  signal  estimation  algorithm  described  in  Chapter  5. 
In  several  examples  presented  throughout  the  thesis,  we  have  observed  that  an  HMM-based 
approximation  of  a  random  process  can  typically  be  described  using  only  a  small  number  of 
non-zero  state  transition  probabilities.  That  is,  upon  constructing  an  approximation  with 
the  method  described  in  Chapter  4,  we  have  found  that  a  transition  of  the  original  state 
vector  from,  say,  time  t  to  time  t  +  1  often  begins  and  ends  in  adjacent  regions  (or  in  the 
same  region)  within  the  optimal  state-space  partition.  Moreover,  state-vector  transitions 
between  regions  that  are  far  apart  tend  to  occur  with  either  negligible  or  zero  probability. 
However,  since  this  special  attribute  of  a  typical  state  trajectory  has  not  been  exploited 
in  the  basic  signal  estimation  algorithm,  it  is  likely  that  the  algorithm  performs  many 
unnecessary  computations  to  produce  the  final  signal  estimate.  With  a  modest  amount  of 
work,  one  could  develop  a  more  sophisticated  algorithm  which  spends  computation  only  to 
deal  with  state  trajectories  that  have  non-negligible  probability.  In  certain  cases  where  the 
state  transition  matrix  of  the  approximating  Markov  chain  is  very  sparse,  it  is  conceivable 
that  the  computational  cost  of  the  algorithm  could  be  reduced  from  0{L2)  to  O(L). 

6.3.3  Further  Development  of  HMM-Based  Approximation  Concept 

Although  the  notion  of  representing  random  signals  approximately  using  finite-state  HMMs 
appears  to  hold  enormous  potential,  we  have  taken  only  the  first  steps  toward  exploiting 
this  concept  in  the  thesis.  A  number  of  theoretical  issues  must  still  be  resolved  before  the 
HMM-based  signal  model  can  serve  as  a  basis  for  routine  signal  processor  design.  It  is  easy 
to  imagine  a  complex  signal  processing  or  decision  system  in  which  signals  are  optimally 
represented  as  finite-state  HMMs,  as  we  have  discussed  earlier.  Within  such  a  system,  a 
particular  signal  may  undergo  a  variety  of  known,  well  defined  transformations,  e.g.,  it 
may  pass  through  a  linear  system,  become  corrupted  by  noise,  or  perhaps  be  subjected 
to  a  memoryless  nonlinearity.  For  each  stage  of  processing  within  the  system,  it  would  be 
useful  to  know  precisely  how  to  best  represent  the  output  of  the  transformation  as  an  HMM 
when  we  are  given  an  optimal  representation  of  the  input  as  an  HMM.  Moreover,  in  cases 
where  two  random  processes  become  combined  during  the  transformation  (e.g.,  signal  and 
noise),  we  would  like  to  know  how  to  jointly  design  the  HMM-based  representations  of  both 
processes  so  that  overall  system  performance  is  optimized.  (Recall  that  in  Chapter  5,  we 
merely  combined  the  existing  HMMs  for  signal  and  noise  to  obtain  a  new,  more  complex 
HMM  for  the  observation;  however,  the  two  original  HMMs  had  been  optimized  individually, 
rather  than  jointly.)  If  problems  such  as  these  can  be  solved  through  further  research,  then 
many  functions  carried  out  within  a  complex  signal  processing  system  could  ultimately 
be  cast  in  terms  of  optimal  operations  on  HMMs.  Because  an  HMM  has  a  particularly 
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simple  stochastic  structure,  we  expect  that  such  operations  would  be  fairly  straightforward 
to  derive. 

6.3.4  Application  of  HMMs  to  Other  Signal  Processing  Problems 

Our  investigation  of  the  HMM-based  approach  to  non-Gaussian  inference  problems  was 
necessarily  limited  in  scope;  specifically,  we  restricted  our  attention  to  the  two  basic  prob¬ 
lems  of  source  identification  and  signal  estimation.  Furthermore,  for  the  signal  estimation 
problem  in  particular,  we  focused  almost  exclusively  on  the  development  of  a  smoothing 
algorithm.  As  we  pointed  out  in  Chapter  5,  however,  filtering  and  prediction  algorithms 
could  be  developed  in  a  similar  manner.  We  also  provided  a  fairly  detailed  outline  indicating 
how  HMMs  could  be  used  to  solve  detection  and  classification  problems  efficiently.  Still, 
there  remain  many  signal  processing  problems  in  which  the  HMM  paradigm  could  be  suc¬ 
cessfully  applied.  Thus,  another  potentially  fruitful  direction  for  future  work  is  to  develop 
HMM-based  solutions  to  problems  such  as  deconvolution  (in  which  the  distorting  system 
may  be  either  known  or  unknown),  joint  detection  and  estimation,  signal  enhancement, 
signal  quantization,  or  compression. 


Appendix  A 


Notational  Conventions  and 
Abbreviations 


The  notational  conventions  and  abbreviations 
and  explained  in  detail  as  they  are  needed.  For 
the  most  important  symbols  and  abbreviations 
to  hold  unless  otherwise  stated. 

A.l  Abbreviations 


used  in  the  thesis  are  generally  introduced 
convenient  reference,  we  summarize  some  of 
below.  These  definitions  should  be  assumed 


AR  — >  autoregressive 

ARGMIX  — >  autoregressive  Gaussian-mixture 
ASK  — »  amplitude-shift  keying 


cdf  -»•  cumulative  distribution  function 
EM  — >  expectation-maximization 
FIR  -»  finite  impulse  response 
GEM  — ►  generalized  expectation-maximization 
HMM  — >  hidden  Markov  model 
HOS  — >  higher-order  statistics 

i.i.d.  — »  independent  and  identically  distributed 
HR  — ►  infinite  impulse  response 
ISI  -»  intersymbol  interference 

LRT  — »•  likelihood  ratio  test 
LTI  — »  linear  and  time  invariant 
ML  — >■  maximum  likelihood 


178 


Chapter  A.  Notational  Conventions  and  Abbreviations 


MAP  -> 
MMSE  -»■ 
MSE  — > 
pdf  — > 
pmf  — » 
psd  — » 
SER  — ► 
SNR  -> 


maximum  a  posteriori 
minimum  mean  squared  error 
mean  squared  error 
probability  density  function 
probability  mass  function 
power  spectral  density 
signal-to-error  ratio 
signal-to-noise  ratio 


A. 2  Notational  Conventions 


R  — *■  the  set  of  real  numbers 
R+  — >  the  set  of  nonnegative  real  numbers 
Rn  — >  the  set  of  n-dimensional  real- valued  tuples 
Z  -»  the  set  of  integers 
Z+  — >  the  set  of  nonnegative  integers 
0  — >  the  empty  set 
U  — >  union  operator  for  sets 
fl  -4  intersection  operator  for  sets 
log  x  — >■  the  natural  logarithm  of  x 
exp  x  — »  the  exponential  ex 
Pr{«4}  — >  probabihty  of  the  event  A 
Pr{^4,  B}  — >  joint  probability  of  the  events  A  and  B 
Pr{^4|B}  -*  probability  of  the  event  A  conditioned  on  the  event  B 
E{Y }  expected  value  of  the  random  variable  Y 

E{Y\Z  =  z)  -»  expected  value  of  Y  conditioned  on  the  event  Z  —  z 
Y  — >  estimate  of  the  random  variable  Y 
{Vj}  — >  discrete-time  random  process 
{Pj}  — >  approximation  of  the  random  process  {It} 

Y i:j  — >  the  vector  (Yi,  Y+i,  ■  •  •  ,  Yj)  if  i  <  j  or 
the  vector  (If,  Ff_ i,  •  •  •  ,  Yj)  if  i  >  j 
fy(-)  pdf  of  random  variable  Y 
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fy  (■;  ’4r)  — >  pdf  of  Y  which  depends  on  the  parameter  vector 
fy,z(-,  •)  joint  pdf  of  random  variables  Y  and  Z 
Iy\z('\Z  —  z)  ~ *  conditional  pdf  of  Y  given  the  event  Z  =  z 

j\f  (•;  p,  o)  — >■  Gaussian  pdf  with  mean  p  and  standard  deviation  cr 
VUxJy)  — >  Kullback-Leibler  distance  between  densities  fx(-)  and  /y  (-) 
I(X,Y)  — >  mutual  information  between  the  random  variables  X  and  Y 
max{h(rr)}  — >  largest  value  of  the  function  h(  )  on  the  set  V 

min{/i(r)}  — >  smallest  value  of  the  function  h(-)  on  the  set  V 


arg  max{h(r)}  — >  element  in  V  yielding  the  largest  value  of  h(-) 
xev 

argmin{/i(a:)}  ->  element  in  V  yielding  the  smallest  value  of  h(-) 
xev 


A. 3  Context-Specific  Symbols 

X4  ->  state  vector  of  a  dynamical  system  at  time  t 
Yt  — >  source  signal  at  time  t 
Zt  — >  observed  signal  at  time  t 
Wt  — >  driving  noise  of  a  dynamical  system  at  time  t 
Vt  ->  additive  observation  noise  at  time  t 

— >  parameter  vector  characterizing  a  pdf  or  a  dynamical  system 
K  -»  order  of  an  autoregressive  process 
L  — >  number  of  states  in  a  Markov  chain  or  an  HMM 
M  — >  number  of  components  in  a  Gaussian-mixture  pdf 
N  — *■  number  of  samples  in  a  finite-length  observation 
— »•  component  of  a  Gaussian  mixture  selected  at  time  t 
©t  — >  state  variable  for  a  Markov  chain  at  time  t 
P(i)  initial  state  probability  Pr{©o  =  z} 

Q{i,j)  — >  state  transition  probability  Pr{©t  =  j\Qt-i  =  i} 

R(i,j)  —>■  joint  state  probability  Pr{0t_i  =  z,  Qt  =  j} 
gi(-)  — >  HMM  output  density  /yt|Qt(-|©t  =  i) 
fi(-)  — >  HMM  output  density  /x(|©t('l®f  =  *) 


Appendix  B 

Maximization  of  a  Function 
Related  to  Cross-Entropy 


In  this  appendix,  we  derive  solutions  to  two  closely  related  optimization  problems  that  arise 
repeatedly  throughout  the  thesis.  One  of  these  problems  deals  with  finite-length  tuples 
whose  elements  are  positive  real  numbers,  and  the  other  deals  with  positive  functions  of 
a  real  variable.  In  the  first  optimization  problem,  we  are  given  an  M-dimensional  tuple 
a  =  (oi,  02,  •  •  •  ,«m)  whose  elements  axe  all  positive  real  numbers,  and  we  seek  the  tuple 
b*  =  (6j,  62)  ‘  ■  >  b\j)  specified  implicitly  through  the  maximization 

M 

b*=argmaxy  afclog&k,  (B.l) 

fi 

where  B  represents  the  set  of  all  tuples  b  =  (bi ,  i>2 ,  -  -  -  ,6a/)  whose  elements  are  positive 
and  satisfy  the  constraint 


M 

£>  =  !•  (B.2) 

fc=i 

For  this  discrete  case,  we  will  prove  that  the  elements  6J,  6£,  •  •  •  , b*M  of  the  optimal  tuple 
are  given  by 

6fc  =  ^sr— ’  fc  =  l,2,---,M.  (B.3) 

i  a3 

In  the  second  optimization  problem,  we  are  given  a  real-valued  continuous  function  «(•) 
which  is  strictly  positive  on  the  open  interval  (xi,S2)  and  has  the  property  that 
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The  function  we  seek,  which  we  denote  by  &*(•),  is  specified  implicitly  as  the  solution  to  the 
maximization  problem 


rx  2 

&*(■)  =  argmax  /  a(x)  log  b(x)  dx, 
b(-)ec  J n 


(B-5) 


where  C  is  the  set  of  all  real-valued,  continuous,  strictly  positive  functions  b(-)  defined  on 
(xi,X2)  that  satisfy  the  constraint 


dx  =  1. 


(B.6) 


For  this  continuous  case,  we  will  prove  that  the  maximizing  function  &*(•)  is  given  by 


b*{x) 


a(x) 

J**  a(u)  du ’ 


X\  <  X  <  X2- 


(B.7) 


Before  proceeding  with  our  proofs,  we  remark  that,  although  both  of  the  assertions  above 
are  made  with  the  assumptions  of  strict  positivity  on  the  variables  involved,  analogous  proofs 
can  easily  be  constructed  when  the  variables  are  taken  to  be  merely  nonnegative. 

Let  us  denote  by  a  =  (ai,a2,  ■  •  ■  ,om)  the  normalized  version  of  the  given  tuple  a,  so 
that  the  elements  of  a  are  defined  by 

dk=  — *  =  (B.8) 

l^j=l  a3 

Clearly,  we  have  that  a  €  B.  Our  strategy  in  proving  that  a  is  the  unique  solution  to  the 
maximization  taken  in  (B.l)  will  be  to  show  that 

M  M 

E  lo§  lo& a*  (B-9) 

*=1  fc=i 

for  all  b  £  B,  and  furthermore  that  equality  holds  in  this  expression  if  and  only  if  b  =  a. 

A  proof  of  the  above  inequality  can  be  developed  by  using  certain  key  properties  of  the 
function  g(x)  =  zlogr,  which  is  defined  for  x  >  0.  Consider  the  second-order  Taylor  series 
expansion  of  g(-)  about  the  point  xq.  given  by 

g{x)  =  g{x0)  +  g'(x0)(x  -  x0 )  +  \g"{x*){x  -  r0)2,  (B.10) 

where  x*  is  a  number  lying  between  x  and  xq  whose  value  is  dependent  on  both  x  and 
xq  (although  we  have  not  indicated  this  dependence  explicitly)  and  whose  existence  is 
guaranteed  by  Taylor’s  theorem  [160].  Observe  that  because 
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Figure  B-l:  Plots  of  g(x)  —  x  log  x  (upper  curve)  and  h(x)  =  x  —  1. 


and 

/(*)  =  ^  (!  +  log®)  =  K  (B.12) 

we  have  that  g"(x )  >  0  whenever  x  >  0.  This  implies  that  the  final  term  ^g"(x*)(x  —  xq)2 
in  the  Taylor  series  expansion  given  above  is  always  nonnegative,  and  in  fact  is  zero  only  in 
the  case  when  x  =  xo-  If  we  now  rewrite  the  Taylor  series  expansion  for  the  particular  value 
xq  —  1,  making  use  of  the  facts  that  g(  1)  =  0,  g'(  1)  =  1,  and  g"(x*)  =  e  >  0,  we  obtain 

rlogx  =  (r  —  1)  +  5e(r  —  l)2.  (B.13) 

From  this  representation  we  conclude  that 

xlogr>r  — 1  (B.14) 

with  equality  if  and  only  if  x  =  1.  This  inequality  provides  a  foundation  for  the  proof  of 
(B.9).  In  Figure  B-l,  we  show  plots  of  the  functions  g(x)  =  a:  log  a:  and  h(x)  =  x  —  1  in  the 
vicinity  of  the  point  xo  =  1. 

If  we  now  replace  the  positive  variable  x  in  (B.14)  with  the  positive  ratio  flfc/hfc,  we 
obtain  the  expression 


184 


Chapter  B.  Maximization  of  a  Function  Related  to  Cross-Entropy 


or  equivalently,  after  multiplying  both  sides  by  bk, 


6*  log  ^->dk-  bk,  (B.16) 

ak 

which  holds  with  equality  if  and  only  if  bk  =  ak.  This  last  expression  actually  represents 
not  one  but  M  distinct  inequalities,  one  for  each  value  of  the  tuple  index  k.  The  left-hand 
and  right-hand  sides  of  these  M  inequalities  may  then  be  separately  summed  to  yield 


M  ,  M  M 

k=  1  k  k=  1  Jfc=l 

=  1-1 

=  0, 


(B.17) 

(B.18) 

(B.19) 


which  holds  with  equality  if  and  only  if  all  of  the  constituent  inequalities  hold  with  equality, 
i.e.,  if  and  only  if  bk  =  ak  for  k  =  1, 2,  ■  ■  •  ,  M.  If  in  (B.19)  we  write  the  logarithm  of  the 
ratio  as  a  difference  of  logarithms,  we  obtain 

M  M 

53  loS  bk  <  53  5jfc  log 5fc-  (B.20) 

k= 1  k- 1 

Finally,  upon  multiplying  both  sides  of  this  expression  by  the  positive  quantity  J2jLi  aki 
which  removes  the  normalization  from  the  coefficients  ak  in  each  summation,  we  obtain 

M  M 

53  ak  log  bk  <  53  ak  log  ak,  (B.21) 

k=l  k= 1 

which  is  what  we  wished  to  show. 


To  prove  the  analogous  result  in  the  continuous  case,  we  need  to  revisit  the  original 
Taylor  series  expansion  given  in  (B.13).  In  particular,  from  (B.13)  we  have 

(B-22) 


where  a(-)  is  the  normalized  function  given  by 

a(r) 


a(x )  = 


fx?  a(u)  duJ 


X\  <  X  <  X2, 


(B.23) 


and  e(x)  is  a  positive  function  whose  existence  is  guaranteed  by  Taylor’s  theorem.  Upon 
multiplying  both  sides  of  (B.22)  by  b(x)  and  then  integrating  from  x\  to  X2,  we  obtain 

J  o(r)log^ydr  =  J  a(x)dx  —  J  b(x)dx  +  j  ^e(x)b(x)  _  1^  dx 
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or,  after  a  bit  of  straightforward  algebraic  rearrangement, 


(B.24) 

(B.25) 


J  a(x)loga(x)  dx  =  J  a(x)  log b(x)  dx  +  A  j  ^e{x)b(x)  ~  1^  dx,  (B.26) 


where  A  =  f**a(u)du  is  the  positive  normalizing  term  taken  from  a(-).  Note  that  the 
second  term  on  the  right-hand  side  of  this  last  expression  is  always  nonnegative.  Moreover, 
owing  to  the  continuity  of  a(-)  and  &(•),  this  term  is  zero  if  and  only  if  a(x)  =  b(x)  for  all 
x  (=  (xi,X2)-  This  observation  gives  the  desired  result  that 


rX2  rX2 

j  a(x)  log  b(x)  dx  <  I  a(x)  log  d(x)  dx , 

Jx  i  Jx  1 


(B.27) 


with  equality  if  and  only  if  a(x )  =  b(x). 
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Computer  Implementation  of  the 
EM  AX  Algorithm 


The  following  source  code  listing,  which  is  written  in  the  MATLAB  programming  language, 
represents  one  implementation  of  the  EMAX  algorithm  derived  in  Chapter  2. 


function  [mu,sig,rho,a]  =  EMAX(y ,nuO ,sigO ,rhoO ,a0 , tol ,n_iter) 


xx 


XX  Description  of  variables 

XX  - 

XX 


XX  Input  arguments : 


XX 

XX  y  .  vector  of  observations  to  be  processed 

XX  muO  .  vector  of  initial  values  for  means 

XX  sigO  .  vector  of  initial  values  for  standard  deviations 

XX  rhoO  .  vector  of  initial  values  for  weighting  coefficients 

XX  aO  .  vector  of  initial  values  for  autoregressive  parameters 

XX  tol  .  numerical  tolerance  for  terminating  algorithm 


XX  n.iter  . . .  number  of  iterations  for  coordinate  ascent 
XX 

XX  Output  arguments: 


XX 

XX  mu  . vector  of  final  estimates  for  means 

XX  sig  .  vector  of  final  estimates  for  standard  deviations 

XX  rho  .  vector  of  final  estimates  for  weighting  coefficients 

XX  a  .  vector  of  final  estimates  for  autoregressive  parameters 

XX 


xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 


xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 

XX 


XX 

XX 

XX 


Compute  dimension  of  autoregression  vector,  number  of  components 
in  Gauss ian-mixture  pdf,  and  number  of  input  observations. 


xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx 
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K  =  length(aO) ; 
M  =  length(muO) ; 
M  =  length (y) ; 


xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx  xxxxxxxxxzxxxxxxxxxxxxxxxxxxxxxzxxxx 

XX 

XX  Construct  convolution  matrix  from  input  data  for  efficient 

V/.  implementation  of  FIR  filtering  operation,  and  modify  observation 

VI.  sequence  accordingly. 

7.7. 

7.7.7.7.7.7.7.7.7.7.7.7.7.7.7. '47.7.7.7.7.7.7.7.7.7.7.7.7.7.7.7.7.7.7.7.X7.X77.7.7.7.7.X7.7.7.X7.7.7.7.7.7.7.7.7.7.7.7.7.7.7.7.7.7.7.77. 


H  =  toeplitz(y(K:l:N-l) ,y(K;-l:l)) ; 
y  =  y(K+l:N) ; 

K  =  length(y) ; 


X7.XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX7.XXX 

7.7. 


XX  Initialize  parameter  values  and  create  parameter  vector  psi. 
7.7. 


X7.7.XX7.X7.XX7.7.7.XXXX7.7.7.7.X7.X7.X7.XX7.7.X7.7.7.7.XX7.7.XXXXXXXXXX7.XX7.7.7.7.7.XXXX7.7.7.XXXX7.X 


a  =  aO; 
mu  =  muO; 
sig  =  sigO; 
rho  =  rhoO; 

psi  =  [a’  mu’  sig’  rho’]; 


XXXXXXXX7.XX7.X7.XXX.X.XX.XXXX7.X7.7.X7.XXXXX7.XXXXXXXXXXXXX.X7.XX.XX.X.XXXXXX.XX.XXXXXX7. 

XX 

XX  Begin  loop  to  iterate  formulas  for  EMAX  algorithm  until 
XX  convergence  is  obtained. 

XX 

X7.7.XXX.XX.X.7.X.X7.XXX.X.X.X.7.7.7.7.7.7.7.7.7.X7.7.X7.7.X.7.X.X.X7.7.7.XX.XX.XX.XX7.XX.X.X.X.XX7.X.XX.X.X.XX.X.XXX7. 


err  =  tol  +  1; 
while  (err  >  tol) , 


XXX7.X7.7.7.7.7.X7.7.7.X7.7.X7.7.7.7.7.X7.7.7.X7.X7.7.7.7.XXXX7.7.X7.7.XXXXX7.XXXXX7.7.XXXXXXXXXXXX7.XX 


XX 


XX  Process  observations  with  inverse  filter  and  make  array  of 
XX  mean-removed  residual  sequences  for  all  classes . 

7.7. 


XXXX.X.XX7.7.XX7.7.XXX.XXX7.X.XXXXXX7.7.X.7.X7.X.XXX7.7.XXXXXXX7.X7.7.XXX.X7.7.XXXXXX.X.X.X.XXXXX7. 


z  =  y  -  H*a; 

zmu  =  z*ones(l,M)-ones(N,l)*mu’ ; 


XXX7.7.7.7.7.XXXXXXXX7.7.XXXX7.X7.XXXXXX7.7.X.XX.X.7.X.XX7.XXXX7.X7.XXXXXX7.XXXXX.X.X.XX7.XXX7.X 

XX 

XX  Compute  posterior  class  probabilities. 

XX 

XX.X.7.X7.7.XXXXXXXXX7.7.XXXXXX7.XXXXXXXXXXXXXXXXXXX.X.X.XXXXXXXXX7.XXXXXXXXX7.XXXXX 


P  =  ones(N,l)*((rho./sig) ’)  .*  ... 

exp(-0.5*(zmn.“2) .*(ones(N,l)*(l./sig’) .*2)) ; 
P  =  P./CsumCP’J^onesd.M))  ; 
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fflrammmmmmrfflmmmmmfflnmmmmmfflffl 

77 

7.7.  Update  estimates  for  means,  standard  deviations,  and  AR 

7.7.  parameters  via  coordinate  ascent  by  iterating  update  formulas 

77  n_iter  times . 

77 

77777777777777777777777777777777777777777777777777777777777777777777777 


for  k  =  l:n_iter, 
z  =  y  -  H*a; 

mu  =  ((sum(P.*(z*ones(l,M)))) ./sum(P)) ’ ; 
zmu  =  z*ones(l,M)-ones(N,l)*mu’; 
sig  =  sqrt(diag(zmu’*(P.*zmu)) ./sum(P) ’) ; 
vgt  =  sum({P.*(ones(N,l)*((l./sig’)  .  ""2))) ’)  ’ ; 

u  =  sum(((P.*(y*ones(l,M)-ones(N,l)*mu’)) ./ {ones (N,l)*( (sig’) .“2))) ’ ) ’ 
•  a  =  inv((H’ .*(ones(K,l)*wgt’))*H)*H’»u; 
end 


77777777777777777777777777777777777777777777777777777777777777777777777 

77 

7.7.  Update  estimates  for  weighting  coefficients. 

7.7. 

77777777777777777777777777777777777777777777777777777777777777777777777 
rho  =  mean(P) ’ ; 


7.7.7.7.7.7.7.7.7.7.77777.7.7.7.7.7.777.77.7.7777.77.7.7.7.7.77.7.7.77.7.7777.77.7.7.77.7.7.7.77.7.777.7.7.7.7777.7.7 

77 

77  Reassign  parameter  vector  psi  and  compute  Euclidean  distance 
77  between  this  value  and  the  previous  value. 

77 

77777777777777.7.77777777.777777777777777777777777777777777777777777777777 


psi.old  =  psi; 
psi  =  [a’  mu’  sig’  rho’]; 
err  =  norm(psi_old  -  psi) ; 
end 


Appendix  D 


Relationship  of  Kullback-Leibler 
Distance  to  Other  Metrics 


In  this  appendix,  we  demonstrate  that  the  Kullback-Leibler  distance  is  closely  related  to  two 
other  important  statistical  measures,  namely  probability  of  detection  and  log-likelihood.  A 
discussion  of  these  relationships  gives  us  a  better  intuitive  understanding  —  from  the  points 
of  view  of  classical  detection  theory  and  parameter  estimation  —  about  the  distributional 
properties  that  are  actually  measured  by  the  Kullback-Leibler  distance. 


D.l  Connection  with  Probability  of  Detection 

To  describe  the  nature  of  the  link  between  Kullback-Leibler  distance  and  probability  of 
detection,  let  us  first  recall  our  earlier  discussion  from  Section  3.2.1  in  which  we  cast  the 
problem  of  quality  assessment  in  terms  of  a  particular  type  of  game  played  with  an  indepen¬ 
dent  observer.  In  this  game,  the  observer  was  furnished  with  descriptions  of  two  densities, 
and  his  goal  was  to  correctly  guess  which  of  them  gave  rise  to  a  given  set  of  realizations. 
We,  on  the  other  hand,  were  able  to  select  one  of  the  two  densities  in  advance  from  a 
given  class  T\  our  goal  was  to  choose  a  density  that  would  confuse  the  observer  more  often 
than  would  any  other  density  in  T.  Because  we  knew  that  the  observer  would  always  use 
a  statistical  test  that  maximized  the  probability  of  a  correct  decision,  we  argued  that  the 
optimal  density  in  T  was  that  which  yielded  the  minimum  possible  value  of  this  probability. 

Recall  that  one  of  the  two  components  making  up  the  probability  of  a  correct  decision 
is  the  probability  of  detection,  or  Pd,  which  is  given  by 


=  Pr{Declare  H\\Hi  true} 

(D.l) 

=  Pr{£(Zo;jv-i)  >  0\H\  true} 

(D.2) 

r  DC 

=  fe\HM\Hi  true)d£. 

Jo 

(D.3) 

This  component  is  a  particularly  important  one  to  examine  because  it  involves  hypothesis 
H\,  which  is  the  only  hypothesis  that  ever  occurs  in  an  actual  signal  processing  situation. 
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(That  is,  any  observations  that  must  be  processed  in  a  realistic  scenario  are  precisely  those 
that  arise  from  the  true  source,  and  not  from  some  approximation  to  the  true  source.)  It 
may  be  just  as  reasonable,  therefore,  to  define  the  quality  of  our  approximation  exclusively 
in  terms  of  Pd  as  it  was  to  define  it  in  terms  of  the  probability  of  a  correct  decision. 

Note  that  Pd  can  be  viewed  as  the  area  under  the  tail  of  the  conditional  density  of 
£(Zo:n-i)  given  Hi.  We  can  leaxn  more  about  the  specific  shape  of  this  distributional  tail 
by  rewriting  the  discrimination  statistic  £(Zo:jv-i)  as 


i  /y0;jv-i(Zo:JV-i) 

(D.4) 

logn 

nl~ol  fy(Zt) 

(D.5) 

log  TT  MZt) 

g  fi  MZt) 

(D.6) 

y  log MZt) 

(D.7) 

N- 1 

(D.8) 

t= o 


where  we  have  used  the  definition 


it 


log 


fy(Zt) 

fyiZtY 


(D.9) 


We  can  easily  see  from  this  re-expression  that  £(Zo:jv-i)  is  merely  the  sum  of  N  i.i.d.  of 
random  variables  which  have  been  derived  from  the  original  set  of  i.i.d.  observa¬ 

tions  {Zt}^1.  Therefore,  if  we  assume  that  the  number  of  observations,  N,  is  very  large, 
then  we  have  by  the  Central  Limit  Theorem  that  £(Zo:at-i)  behaves  approximately  like  a 
Gaussian  random  variable  [42,  55,  57]. 

Because  £(Zo:at_i)  is  ultimately  compared  to  a  threshold  of  zero  in  the  optimal  test,  we 
now  introduce  a  more  convenient,  normalized  version  of  this  statistic  given  by 


1  1  N 
l(Z0:,V— l)  =  ^(Zo:*— l)  =  - 


(D.10) 


t=i 


Since  this  normalized  statistic  will  also  be  approximately  Gaussian  for  large  IV,  we  write 


true)  ~  cr),  (D.ll) 

where  the  mean  and  standard  deviation  parameters,  p  and  cr,  are  understood  to  be  depen¬ 
dent  on  the  particular  density  /y (•)  that  is  selected  from  T. 

To  simplify  the  remaining  exposition,  let  us  now  suppose  that,  as  we  vary  fy(')  over  the 
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entire  set  JF,  the  range  of  values  taken  by  a  is  extremely  small  in  comparison  to  the  range 
of  values  taken  by  p,  so  that  o  can  be  considered  essentially  constant.  Then,  if  we  wish 
to  confuse  the  observer  by  minimizing  his  probability  of  detection  (i.e.,  by  putting  as  little 
area  as  possible  under  the  positive  tail  of  f^Hl  (■))>  we  need  only  choose  the  density  in  jF 
associated  with  the  smallest  value  of  p.  But  observe  that  the  value  of  p  is  given  by 


p  =  E{£(Zo:N-i)\Hi  true} 

f  i  ^ -i 

l  t= o 

=  E  {Iq\Hi  true} 

Iy(Zo) 


=  e{ 


log 


/y(Zb) 


H i  true 


} 


j°°  fy(y)  log  dy 

J -oo  fyiV) 


=  D(fyJy). 


(D.12) 

(D-13) 

(D.14) 

(D.15) 

(D-16) 

(D.17) 


In  other  words,  p  is  precisely  the  Kullback-Leibler  distance  between  the  true  and  approx¬ 
imate  densities.  We  conclude,  therefore,  under  the  assumptions  stated  above,  that  as  the 
number  of  observations  becomes  very  large,  minimizing  the  Kullback-Leibler  distance  be¬ 
tween  fy  and  fy  is  approximately  equivalent  to  minimizing  the  observer’s  probability  of 
detection. 


D.2  Connection  with  Log-Likelihood 

We  now  show  that  there  is  a  direct  connection  between  Kullback-Leibler  distance  and  log- 
likelihood,  in  the  sense  that  minimizing  the  Kullback-Leibler  distance  to  obtain  the  best 
parametric  description  of  a  pdf  is  equivalent  to  maximizing  the  log-likelihood  function  in 
a  closely  related  problem.  Before  describing  the  relationship  between  these  two  measures 
more  precisely,  let  us  first  consider  the  two  optimization  problems  that  link  them.  These 
problems  may  be  stated  as  follows: 

(i)  (Minimization  of  Kullback-Leibler  Distance .)  Let  Y  be  a  discrete-valued  random 

variable  which  is  distributed  on  the  finite  set  {1, 2,  •  -  •  ,  L}  according  to  the  pmf  fy  (•). 
Assume  that  the  pmf  values  {fy{j)}f=i  are  given.  Let  J  be  a  parameterized  set  of 
pmfs  defined  by  T  =  {fy  (•;  *£)  |  €  V},  where  V  represents  a  collection  of  admissible 

parameter  values.  Determine  the  set  of  all  parameter  values  \&kl  €E  V  that  satisfy 

^kl  =  argmin  log  -ftl-.  (D.18) 

(ii)  (Maximization  of  Likelihood  Function.)  Let  {Yi}^1  be  a  set  of  i.i.d.  discrete-valued 
random  variables,  each  distributed  on  the  finite  set  {1, 2,  •  •  •  ,  L}  according  to  the  pmf 
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/y  (•),  and  let  {yt}^1  be  a  corresponding  set  of  realizations  of  these  random  variables. 
Assume  that  the  pmf  values  {fy(j)}f=i  are  unknown,  but  that  the  parameterized 
collection  of  pmfs  T  =  ’3')  |  '3'  €  V}  is  hypothesized  (possibly  incorrectly)  to 

contain  /y(-).  Determine  the  set  of  all  parameter  values  'S' ml  G  V  that  satisfy 

JV-1 

^ml  =  argmax  TT  fy(yt ;&)  (D.19) 

t=o 


We  will  demonstrate  that,  as  TV  — »  oo  in  problem  (ii),  the  sets  {’I'ml}  and  {’S'kl}  become 
identical,  i.e.,  any  parameter  value  that  asymptotically  maximizes  the  hypothesized  likeli¬ 
hood  function  also  minimizes  the  Kullback-Leibler  distance  between  the  hypothesized  pmf 
and  the  true  pmf,  and  vice  versa. 

To  establish  this  result,  it  will  be  convenient  to  introduce  a  function  known  as  the 
empirical  pmf  (also  sometimes  termed  the  type  [41]),  which  we  denote  by  /y(-).  In  the 
context  of  problem  (ii)  above,  the  empirical  pmf  is  defined  in  terms  of  a  particular  realization 
yo:Ar— l  of  the  random  vector  Yo:;v-i  according  to  the  formula 


N- 1 


IyU)  —  Sj,yt  j  —  1)  2,  •  •  •  ,  A, 


t= 0 


where  Sj^  is  the  Kronecker  delta  function,  defined  by 


Si 


1  if  j  =  k, 

0  otherwise, 


(D.20) 


(D.21) 


The  empirical  pmf  may  be  viewed  alternatively  as  a  histogram  having  L  distinct  bins  (one 
for  each  of  the  L  symbols  that  could  occur  within  the  realization)  whose  bin  totals  have 
been  normalized  so  that  they  sum  to  one;  in  other  words,  the  empirical  pmf  merely  provides 
a  record  of  the  relative  frequency  of  occurrence  of  each  symbol. 

We  first  seek  to  establish  that  the  maximum  likelihood  parameter  estimation  problem 
given  in  (D.19)  explicitly  involves  the  empirical  pmf.  Observe  that  by  taking  the  logarithm 
of  the  right-hand  side  of  (D.19)  and  then  pre-multiplying  by  a  factor  of  l/N,  we  arrive  at 
an  equivalent  expression  of  the  maximum-likelihood  problem  given  by 

1  N~1 

#  =  arg  max  —  ^  log  (yt;  *).  (D.22) 

*  t= o 

Now,  since  any  given  observation  yt  must  assume  exactly  one  of  L  values,  each  of  the 
N  terms  in  the  above  summation  can  be  placed  into  one  of  L  categories  according  to  its 
value.  After  all  of  the  terms  have  been  categorized  in  this  way,  the  jth  category  would 
then  contain  a  total  of  Sjm  terms.  This  suggests  that  we  can  rewrite  the  above 
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log-likelihood  function  as 

iv-i  L  /  N- 1  \ 

^  E  'os  fy(vt;  *)  =  £  hr  E  si.y  los  *> 

i=0  j=l  \  t=0  / 

L 

=  ^  Mi)  log /f  (.?;*). 
i=i 

Therefore,  we  can  express  (D.19)  equivalently  as 

L 

&  =  arg  max  V  /V(j)  log  fy  (j;  ’£). 


(D.23) 

(D.24) 


(D.25) 


As  an  intermediate  step,  let  us  now  examine  the  Kullback-Leibler  distance  between  the 
empirical  pmf  /y(-)  and  the  hypothesized  pmf  fy (•;  ^),  which  is  given  by 

C(/v,/y)=£/r  M ‘OS 

Z,  L 

=  5^  /y0‘)  lo§  /y  U)  -  J2  fy  O')  los  /y0‘;  *) 
j=i  j=i 

In  this  last  equality,  we  note  that  the  first  term  is  a  constant  and  is  therefore  entirely 
independent  of  the  parameter  \&.  Thus,  in  order  to  minimize  the  Kullback-Leibler  distance 
given  above,  we  may  simply  ignore  the  first  term  in  (D.27)  and  then  attempt  to  maximize 
the  second  term  (after  dropping  the  leading  minus  sign).  But  this  once  again  yields  the  rule 

L 

#  =  argmaxV/yO'Jlog/yO;^),  (D.28) 

•*ev 


(D.26) 

(D.27) 


which  is  identical  to  (D.25).  We  have  therefore  shown  that  maximizing  the  log-likelihood 
function  in  (D.25)  leads  to  precisely  the  same  result  as  minimizing  the  Kullback-Leibler 
distance  between  the  empirical  pmf  /y(-)  and  the  hypothesized  pmf  fy{-\  ’$'). 

But,  in  the  limit  as  N  — »  oc,  we  have  that  the  empirical  pmf  /y(-)  converges  in  proba¬ 
bility  to  the  true  distribution  /y(-),  as  shown  by 


(/y(j)  -  /y  (j))  | 

=  {/y(i)  -  2/y(j)/y(j)  +  fy{j)  j 


=  lim 

AT-^oo 


E- 


S  E  1  -  »  4  E  }  A-W  +  /ytf) 

s  t=0  /  J  l  t—0  J 
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=  ifiL  •  VJ  -  2/y(i)  +  /y(i) 

L  t= o  s=o 

t  i 

'AT2  —  AT  1 

=  0.  (D.29) 

We  conclude,  therefore,  that  choosing  a  parameter  value  that  minimizes  the  Kullback- 
Leibler  distance  between  the  hypothesized  pmf  and  the  true  pmf  is  equivalent  to  choosing  a 
parameter  value  that  maximizes  the  hypothesized  likelihood  function  formed  from  an  infinite 
number  of  independent  realizations  from  the  true  pmf.  This  close  connection  between 
estimates  formed  by  minimizing  Kullback-Leibler  distance  and  by  maximizing  the  likelihood 
function  has  been  recognized  previously  by  several  authors,  including  Kullback  [106],  Kriz 
and  Talacko  [103],  Hartigan  [73],  and  Akaike  [7]. 


Appendix  E 

A  Gradient-Descent  Technique  for 
Evaluating  HMM  Parameters 


Throughout  much  of  Chapter  3,  we  focused  exclusively  on  finding  an  abstract,  theoretical 
solution  to  our  problem  of  finite-state  signal  approximation.  In  this  appendix,  we  develop 
a  simple  gradient-based  algorithm  to  implement  our  solution.  The  goal  of  this  algorithm  is 
to  produce  explicit  numerical  values  for  the  parameters  of  the  best  HMM-based  represen¬ 
tation  of  an  arbitrary  stationary  first-order  AR  process  {Yi}.  We  shall  assume  throughout 
the  appendix  that  we  are  given  the  true  bivariate  signal  pdf  /y0)yi(-),  which,  as  we  have 
demonstrated  in  Section  3.3,  summarizes  all  information  relevant  to  the  search  for  the  best 
HMM.  Such  an  assumption  entails  no  loss  in  generality,  since  this  bivariate  pdf  can,  at  least 
in  principle,  be  derived  from  a  complete  specification  of  the  original  AR  process.  Because 
the  vector  of  breakpoints  d  =  (do,d\,  ■  ■  ■  ,d£)  will  ultimately  determine  the  parameter  val¬ 
ues  of  the  approximating  HMM,  we  concentrate  on  finding  the  best  value  of  d  based  on  the 
given  function  /y0,yi(-)- 

E.l  Formulation  of  the  Optimization  Problem 

We  begin  by  reviewing  the  basic  components  involved  in  the  finite-state  approximation 
problem,  and  by  introducing  a  slight  variation  of  our  previously  established  notation  for 
these  components.  Recall  that  the  joint  probability  mass  function  characterizing  the  pair 
of  successive  state  variables  (0o,  @1)  in  the  approximating  Markov  chain  is  defined  by  the 
formula 


rdj  rdi 

d)  =  /  /  fY0,Y1(yo,yi)dyodyi,  (E.l) 

Jdj- 1  Jdi. _! 

where  we  have  now  shown  an  explicit  dependence  on  the  vector  of  breakpoints  d.  In 
addition,  the  marginal  pmf  for  the  single  state  variable  ©o  is  given  by 


/y0(yo)  dy0. 


(E.2) 
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or  equivalently,  via  the  joint  pmf  i?(-;d),  by  the  expression 

L  L 

-P(*;d)  =  =  ^i?(j,i;d).  (E.3) 

j=i  j= i 

Our  objective  is  to  find  the  particular  value  of  d  that  maximize  the  mutual  information 
between  the  state  variables  ©o  and  ©i-  The  mutual  information  corresponding  to  a  given 
value  of  d  is  defined  by 


L  L 

7(d)  =  Y1Y1 d)  loS 


i=l  j=l 


d) 

P(i-,d)P(j;dy 


Thus,  we  wish  to  solve  the  maximization  problem 


(E.4) 


d*  =  arg  max  /(d),  (E.5) 

d€V 

where  V  can  be  viewed  as  a  subset  of  Ri+1  that  enforces  the  strict  ordering  constraint 

—oo  =  d0  <  di  <  d.2  <  ■  •  ■  <  <1l  =  oo.  (E.6) 

Once  the  optimal  breakpoint  vector  d*  has  been  obtained,  we  can  easily  compute  the  critical 
joint  state  probabilities  {R(i,  j;d*)}^=1  by  numerically  integrating  the  pdf  /y0,yi(-)  over 
appropriate  rectangular  regions  in  M2,  as  suggested  by  (E.l).  Similarly,  the  remaining  HMM 
parameter  values  can  also  be  easily  computed. 


E.2  Finding  an  Optimal  Solution  with  a  Classical 
Hill-Climbing  Algorithm 

Because  the  maximization  problem  in  (E.5)  cannot  be  solved  in  closed  form,  we  will  pursue 
a  simple  iterative,  hill-climbing  procedure  based  on  the  principle  of  steepest  ascent.  Such  a 
procedure  generates  a  sequence  of  breakpoint  vector  values  {d^^d^^d^,  ■  •  ■  }  according 
to  the  recursive  formula 


d(s+i)  =  d(*)  +  A<s>  V7(d(s) ) ,  (E.7) 

where  s  is  the  iteration  index,  V/  is  the  mutual  information  gradient  vector  defined  by 

(E.8) 

I  t/U-1  UU,L,  J 

and  A(s>  is  a  real  number  chosen  such  that 

A(s)  =  arg  max  { I  (d(s)  +  AV/(d(s)))  }  .  (E.9) 


w(d)=< 
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The  core  of  the  steepest-ascent  algorithm  therefore  consists  of  two  steps:  (i)  calculation 
of  the  gradient  vector  VI( d^);  and  (ii)  solution  of  the  univariate  maximization  problem 
in  (E.9)  (often  referred  to  as  the  line  search  portion  of  the  algorithm)  to  obtain  the  proper 
scalar  multiplier  A^. 

We  remark  that  it  is  sometimes  difficult  —  through  purely  analytical  means  —  to  obtain 
an  exact,  globally  optimal  solution  during  the  line  search  in  step  (ii).  This  should  not  be 
surprising,  since  the  original  L  +  1-dimensional  maximization  of  the  same  objective  function 
was  sufficiently  complicated  that  it  required  a  numerical  optimization  procedure  in  the  first 
place.  In  many  such  cases,  it  is  desirable  to  abandon  an  intensive  search  for  the  optimal 
solution  in  favor  of  an  alternative  method  that  yields  a  somewhat  coarse,  approximate 
solution  but  consumes  far  less  computation.  In  the  present  case,  it  is  reasonable  (from  the 
standpoint  of  minimizing  computational  expense)  to  choose  a  pseudo-optimal  value  of  A^ 
by  searching  over  a  small,  finite  set  of  candidate  values  {A^s\  A^,  -  ■  •  ,  A^},  which  may  be 
allowed  to  change  at  each  iteration;  hence,  the  modified  line  search  will  take  the  form 


'(s)  =  arg  max 

A€{Ai,A2,"'A/} 


{ I  ^s>  +  AV/(d^))|. 


(E.10) 


A  number  of  methods  are  available  for  determining  which  candidate  values  should  be  in  the 
above  search  set,  as  well  as  for  determining  the  appropriate  cardinality  of  this  set  (see,  for 
example,  [101,  112,  202]). 

To  complete  the  description  of  the  steepest-ascent  algorithm,  we  need  to  include  suitable 
procedures  for  both  initialization  and  termination  of  the  recursion  in  (E.7).  To  initialize  the 
algorithm,  we  must  assign  a  reasonable  value  to  the  vector  d^.  Of  course,  since  the  first 
and  last  elements  of  this  vector  are,  as  always,  constrained  by  the  equations  d^  =  — oo  and 
=  +oo,  we  need  not  be  concerned  with  finding  values  for  these  elements.  However,  for 
the  remaining  elements  <4°\  <4°\  -  •  •  ,  1?  we  can  use  the  values  that  result  from  imposing 

the  simultaneous  conditions 

rd(i)  r40)  r°°  1 

J  fY0{y)  dy  =  fYo(y)dy  =  ---  y^o)  fYo(y)dy  =  j -,  (E.ll) 

so  that  all  intervals  initially  have  equal  probability.  This  method  of  assigning  starting 
breakpoint  values  tends  to  work  well  in  practice.  To  terminate  the  algorithm,  we  can 
simply  iterate  the  recursion  in  (E.7)  until  we  reach  a  point  at  which  the  magnitude  of  the 
applied  perturbation  A^V/(d^)  is  below  some  prespecified  threshold  value. 


E.3  Derivation  of  the  Mutual  Information  Gradient 

The  only  ingredient  that  is  missing  from  the  above  description  of  the  steepest-ascent  algo¬ 
rithm  is  a  formula  for  the  gradient  of  the  mutual  information  with  respect  to  the  breakpoint 
vector  d.  Because  this  quantity  is  an  essential  part  of  the  algorithm,  and  because  the  math¬ 
ematical  expression  for  it  is  rather  involved,  we  devote  this  subsection  to  a  derivation  of 
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VI. 

We  begin  by  applying  basic  principles  of  differential  calculus  to  the  expression  in  (E.4). 
Specifically,  after  some  calculation  we  find  that  the  partial  derivative  of  1(d)  with  respect 
to  the  element  dk  is  given  by 


am 

ddk 


L  L 

i=l  j= 1 


dR(i,j;  d)  (  R(i,r,  d)  \ 

ddk  V  g  P(i;  d)P(j;  d) ) 


R(i,3\d)  ( dP(j;d) 

P(i;d)P(j-,d){  ddk  + 


dP(j;  d)\ 
ddk  ) 


(E.12) 


While  this  expression  is  also  true  for  the  special  index  values  k  =  0  and  k  =  L,  in  these 
cases  we  have  that 


dl(d)  _  d/(d) 
ddo  ddL 


(E.13) 


owing  to  the  fact  c?o  and  dk  are  constants;  thus,  in  the  following  derivation  we  shall  focus 
only  on  the  intermediate  index  values  &  =  1,2, ■■■  ,L  —  1.  We  note  in  addition  that  the 
right-hand  side  of  (E.12)  involves  derivatives  of  both  of  the  probability  mass  functions  R 
and  P.  However,  since  we  have  already  established  the  identities 


dP(i ;  d)  ^  ^  dR(i,  j;  d)  ^  A  dR(j ,  i ;  d) 
ddk  9dk  r-i  ddk 

j-i 


(E.14) 


we  know  that  the  derivatives  of  I  actually  depend  on  the  derivatives  of  R  alone;  therefore, 
in  our  remaining  calculations,  it  suffices  to  focus  our  attention  only  on  the  function  R. 

From  (E.l)  we  see  that  the  functional  dependence  of  R  on  the  elements  of  d  is  expressed 
implicitly  through  the  upper  and  lower  limits  of  a  definite  double  integral.  Because  of  this 
unusual  implicit  dependence,  the  calculation  of  a  partial  derivative  such  as  dR(i,j]d)/ddk 
will  require  some  rather  careful  bookkeeping.  We  begin  by  making  some  simplifying  obser¬ 
vations.  First,  note  that  each  double  integral  in  (E.l)  is  evaluated  over  a  rectangular  region 
of  the  form  [di-\,di]  x  [dj-\,dj\  within  the  coordinate  plane,  and  furthermore  that  there 
are  a  total  of  L2  such  regions  making  up  the  entire  plane.  When  the  value  of  a  single  break¬ 
point  dk  is  modified  slightly,  only  certain  of  these  regions  are  affected,  and  therefore  only 
certain  partial  derivatives  with  respect  to  dk  are  nonzero.  The  relevant  question  is  whether 
the  perturbed  breakpoint  dk  coincides  with  at  least  one  of  the  breakpoints  di-i,  dt,  dj-i,dj , 
which  define  a  particular  rectangular  region. 

For  any  given  region,  exactly  one  of  the  following  four  cases  will  be  true  regarding  the 
relationship  between  the  breakpoint  indices  i,  j,  and  k: 

(a)  k  g  {i-  1,  z}  and  k  g  {j  -  l,j} 

(b)  k  <£{i-  M)  and  k  E  [j  -  1  ,j} 

(c)  k  E  {i  -  M}  and  k  &  { j  -  1  ,j} 
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(d)  k  e  {i-  l,i}  and  k  €  {j  -  1,  j} 


These  four  cases  are  depicted  in  Figures  E-l(a)  through  E-l(d),  respectively.  The  heavy  lines 
shown  in  the  horizontal  and  vertical  dimensions  in  each  figure  correspond  to  the  breakpoint 
dk  that  is  being  perturbed.  All  of  the  figures  depict  the  same  partitioning  of  the  coordinate 
plane  into  L2  rectangles,  but  each  figure  highlights  a  different  subset  of  these  L 2  rectangles 
through  special  shading.  For  example,  the  shaded  rectangles  shown  in  Figures  E-l(a)  are 
those  whose  associated  index  sets  {i  —  l,i}  and  {j  —  1,  j}  satisfy  condition  (a)  from  the 
above  list.  Similarly,  the  index  sets  associated  with  the  shaded  rectangles  in  Figures  E-l(b) 
through  E-l(d)  satisfy  conditions  (b)  through  (d),  respectively. 

Observe  that  none  of  the  shaded  rectangles  shown  in  Figure  El-1  (a)  is  affected  by  a 
change  in  <4;  hence,  we  have  that 


dR{i,j-,  d) 
ddk 


(E.15) 


whenever  condition  (a)  is  true.  In  Figure  El-l(b),  however,  each  of  the  shaded  rectangles 
will  be  affected  in  some  way  by  a  change  in  dk.  In  particular,  the  rectangles  situated  below 
the  heavy  horizontal  line  (i.e.,  those  that  have  dk  as  an  upper  limit),  will  increase  in  area 
if  dk  increases;  on  the  other  hand,  the  rectangles  situated  above  heavy  line  will  decrease  in 
area  if  dk  increases.  More  specifically,  if  we  change  the  value  of  dk  by  a  very  small  amount 
to  the  new  value  dk  +  A,  then  the  value  of  the  double  integral  over  a  rectangle  just  below 
the  line  —  say,  the  rectangle  associated  with  the  index  set  {i  —  1,  z}  on  the  horizontal  axis 
—  will  change  by  approximately  the  amount  A  /r0,y1(yoJ  dk)  dyo-  Moreover,  an  equal 
and  opposite  change  will  occur  in  the  value  of  the  double  integral  over  the  corresponding 
rectangle  just  above  the  line.  Thus,  in  the  limit  as  A  — »  0  we  can  write 

dR^ddk  ^  =  ~  Jd  (yo’ dk )  dy°’  (E.16) 


where  Skj  is  the  Kronecker  delta  function  defined  by 


fl,  i  fk=j, 

\0,  if  kjLj. 


(E.17) 


An  analogous  argument  holds  for  condition  (c),  which  is  depicted  in  Figure  El- 1(c).  When 
this  condition  is  true,  we  have  that 


dRjiJ;  d) 
ddk 


rdj 

1]  /  fY0,Yi(dk,y\)dyu 

Jdj- 1 


(E.18) 


Finally,  when  condition  (d)  is  true,  each  of  the  shaded  rectangles  shown  in  Figure  El- 1(d) 
undergoes  two  different  kinds  of  changes  when  dk  changes,  one  in  the  vertical  dimension 
and  one  in  the  horizontal  dimension.  The  overall  affect  on  the  value  of  the  double  integral 


Figure  E-l:  Depiction  of  four  separate  cases  encountered  in  calculation  of  the  gradient 
vector.  Shaded  rectangular  regions  undergo  the  following  types  of  changes  when  the  value 
of  <4  is  perturbed:  (a)  no  change  at  all;  (b)  change  in  the  vertical  dimension  only;  (c)  change 
in  the  horizontal  dimension  only;  (d)  change  in  both  the  vertical  and  horizontal  dimensions. 
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over  each  rectangle  is  simply  the  sum  of  these  two  individual  changes.  Thus  we  can  write 


dR{i,j-,d) 

ddk 


rdi 

=  [4j  -  4j-i]  /  /y0,n (yo,  <4)  dy0 

Jdi--, 


fdj 

+  fY0,Yi(^k,yi)dyi. 

Jdj- 1 


(E-19) 


In  fact,  though  it  may  not  be  evident  at  first,  this  last  expression  concisely  represents  each 
of  the  conditions  (a)  through  (d).  A  careful  inspection  reveals  that  (E.19)  is  valid  regardless 
of  whether  <4  is  an  upper  or  lower  limit  in  the  inner  integral,  an  upper  or  lower  limit  in  the 
outer  integral,  or  not  a  limit  at  all.  Although  applying  the  formula  in  (E.19)  may  require 
the  use  of  a  standard  numerical  integration  procedure,  the  formula  itself  now  allows  us 
to  easily  evaluate  all  other  derivatives  needed  during  the  operation  of  our  steepest-ascent 
algorithm,  namely  those  expressed  in  (E.14)  and  (E.12). 


Appendix  F 


An  Algorithm  for  Computing 
Posterior  HMM  State  Probabilities 


During  our  analysis  of  the  signal  estimation  problem  in  Chapter  5,  we  demonstrated  that, 
if  the  signal  and  noise  are  independent,  additively  combined  processes,  and  if  each  is  the 
output  of  a  finite-state  HMM  whose  state  densities  are  Gaussian  mixtures,  then  the  ob¬ 
servation  is  also  the  output  of  an  HMM  of  this  same  type  (albeit  one  with  many  more 
parameters).  The  special  HMM-based  structure  of  the  observed  signal  gives  rise  to  an  ex¬ 
tremely  efficient  recursive  algorithm  for  computing  the  posterior  probabilities  associated 
with  the  states  of  the  underlying  Markov  chain  [19,  47,  78,  154,  155].  In  this  appendix, 
we  shall  construct  the  overall  algorithm  in  three  distinct  steps.  First,  we  develop  a  re¬ 
cursion  that  runs  forward  in  time,  accounting  for  all  past  observations  at  any  given  time 
by  computing  the  quantity  /zO;t,©t(z0:«;  ©t  =  *)■  Next,  we  develop  a  complementary  re¬ 
cursion  that  runs  backward  in  time,  accounting  for  all  future  observations  at  any  time  by 
computing  the  quantity  /z<+1;N_i|©t(zt+i:tf-il©«  =  *)■  Finally,  we  combine  the  results  of 
the  forward  and  backward  recursions  to  get  the  value  of  the  desired  posterior  state  prob¬ 
abilities  Pr{0*  =  i|Zo:jv— i  =  zo-jv-i}-  After  considering  each  of  these  steps  in  turn,  we 
then  describe  a  special  numerical  conditioning  procedure  that  must  be  incorporated  into 
the  recursive  algorithm  when  it  is  implemented  on  a  digital  computer. 


F.l  Developing  the  Forward  Recursion 

We  begin  by  deriving  the  recursion  that  runs  forward  in  time.  Let  us  define  the  forward 
variable  at(i)  as 


at(i)  =  /z0;t,©t(zO:i,  ©i  =  *'),  (F.l) 

where  it  is  understood  that  the  time  index  t  lies  in  the  set  {0, 1, •  •  ■  ,N  —  1}  and  the  state 
index  i  lies  in  the  set  {1, 2,  •  •  •  ,  L}.  To  develop  a  recursive  procedure  for  computing  all  of 
the  values  at(i),  we  require  a  method  for  starting  the  recursion  (the  initialization  step)  and 
a  method  for  carrying  the  recursion  forward  in  time  (the  induction  step).  The  initialization 
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step  follows  directly  from  quantities  we  already  know;  in  particular,  we  have 


«o(*)  =  fzQ,e0{zo,  ©o  =  i)  =  P{i)hi(z0),  i  =  1, 2,  •  ■  •  ,  L.  (F.2) 


The  induction  step  requires  a  little  more  work.  Let  us  assume  that  the  values  {as(j)}  of 
the  forward  variable  have  already  been  computed  for  the  time  indices  s  —  0, 1,  •  •  •  ,  t  and 
for  the  state  values  j  =  1,2, •••  , L.  We  wish  to  relate  these  known  values  to  the  new,  as 
yet  unknown  value  at+i(i).  Using  the  definition  of  at+i(i),  we  may  write 


at+ l(t)  =  /zO:i+l,©e+l(Z0:m>©t+l  =  0  (F-3) 

L 

—  y^/z0;t+1,et,et+i(zO:t+l;Qt  —  O',  ®t+l  =*)  (F.4) 

3- 1 
L 

=  ^2  fz0:t,Zt+i,@t,@t+lizO:t,  Zt+l,Qt  =  j,  ©t+1  =  *)  (F.5) 

3= 1 
L 

=  £Pr{©t+1  =  t|©t=i}- 

3- 1 

/zt+1|et+1(zt+i|©t+i  =  *)/z0:t,et(zo:t5©«  =  j)  (F.6) 

’  L 

-  hi(zt+i)-  (F.7) 

3= 1 


This  is  the  desired  recursive  relationship.  In  Figure  F-l,  we  give  a  graphical  depiction  of  the 
computations  that  are  required  to  generate  the  future  forward  variable  value  a*+i  (i)  from 
the  L  currently  available  values  {at{j)}f—y  The  figure  depicts  a  trellis  whose  dimensions 
are  state  and  time.  A  feasible  trajectory  of  the  state  variable  over  the  entire  length  of  the 
observation  can  be  envisioned  on  this  trellis  as  a  polygonal  path  that  intersects  exactly  one 
state  node  at  each  time  index  between  0  and  N  —  1. 


The  figure  shows  the  path  segments  that  could  lead  to  state  i  at  time  t  +  1  from  the 
L  possible  states  at  the  immediately  preceding  time  t.  Recall  that  the  quantity  at  (j )  is 
the  joint  probability  that  (i)  the  vector  zo-.t  was  observed,  and  (ii)  the  state  at  time  t  was 
j.  This  means  we  can  interpret  the  product  at(j)Q(j,  i)  as  the  joint  probability  that  (i) 
zo:t  was  observed,  and  (ii)  state  i  was  reached  at  time  t  +  1  by  way  of  state  j.  If  we  then 
add  together  all  of  the  products  of  this  form  (i.e.,  the  products  for  all  possible  values  of 
j,  holding  fixed  the  time  value  t  and  state  value  i)  we  obtain  the  joint  probability  that  (i) 
the  vector  zo was  observed,  and  (ii)  the  state  at  time  t  +  1  was  i.  Once  this  probability 
has  been  computed,  we  see  that  the  new  quantity  ott+\  (i)  can  be  evaluated  by  multiplying 
the  summed  quantity  by  the  output  pdf  value  hz(zt+i);  this  accounts  for  the  fact  that  zt+i 
was  observed  while  in  state  i.  An  analogous  computation  is  carried  out  for  each  possible 
state  value  i  at  time  t  +  1.  To  keep  the  recursion  moving  forward,  we  then  repeat  this  same 
overall  sequence  of  computations  at  the  subsequent  time  index. 
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STATE  1 

STATE  2 

STATE  3 
■ 

■ 

■ 

STATE  L 


TIMEO 


TIME  1  *  ■  *  TIMEt 

FORWARD  RECURSION  - 


TIME  t+1 


TIME  N-1 


Figure  F-l:  Illustration  of  the  forward  recursive  procedure  used  to  compute  the  new  quantity 
at+i(i)  from  the  L  values  {c** {j)}j=i  just  computed.  Recursion  is  depicted  on  an  L  x  N 
trellis  whose  nodes  represent  possible  states  of  the  underlying  Markov  chain  at  each  time 
index.  Shaded  sets  of  nodes  are  those  for  which  forward  variable  has  already  been  evaluated. 
Vertical  dashed  line  indicates  current  stage  of  computation. 


We  summarize  the  forward  recursion  with  the  following  two  formulas: 
INITIALIZATION 

<*o(*)  =  P(i)hi{z0),  i  =  1,2,  •  •  •  ,  L 


INDUCTION 


<*t+iW  = 


L 

X) at{j)Q{j,i ) 

j= 1 


hi(zt+ 1)? 


i  —  1, 2,  •  •  •  ,  Jj\ 
t  =  0, 1,  -  -  -  ,  JV  —  2. 


(F.8) 

(F.9) 


F.2  Developing  the  Backward  Recursion 

We  now  derive  the  recursion  that  runs  backward  in  time.  For  this  procedure,  we  define  the 
backward  variable  /3((i)  as 


Pt(i)  =  /zt+i:*_i|et(zm:N-i|©t  =  *),  (F-10) 

which  holds  when  the  state  index  i  is  contained  in  {1,2, •••  ,L}  and  the  time  index  t  is 
contained  in  {0, 1,  •  •  •  ,N  —  2}.  For  the  final  time  index  t  —  N  —  1,  we  no  longer  use  the 


208 


Chapter  F.  Ad  Algorithm  for  Computing  Posterior  HMM  State  Probabilities 


definition  in  (F.10),  but  instead  arbitrarily  specify  that 


pN-i(i)  =  l,  i  =  l,2,---,L.  (F.ll) 

Once  this  initialization  is  done,  we  can  develop  the  recursion  for  (3t(i)  just  as  we  did  for 
a£(i),  now  progressing  backward  in  time  rather  than  forward.  Specifically,  we  can  write 


fzt+l:N-l\Qt(Zt+l-N-l\®t  =  i) 

T 

(F.12) 

L 

Y.  /zt+i;y.i  ,Qt+iiet  (Zf+1:N-1,  Qf+1  =  j|@t  =  *) 

3= 1 

(F.13) 

L 

^  y  f  i^t+1 ;  ^t+2:N—lt  €>t+l  j|©£  *) 

3=1 

(F.14) 

L 

5^Pr{0£+1  =  j|0t  =  i}fZt+1\Qt+1(zt+i\et+i  =  j)  ■ 

1  =  1 

j 

/Z(+2:JV-ll©t+l  fet+2:N—  1  |Ot+l  j) 

T 

(F.15) 

Lt 

5Z  t+1 

3= 1 

(F.16) 

This  last  formula  is  the  desired  recursive  relationship.  In  Figure  F-2,  we  give  a  graphical 
depiction  (analogous  to  the  one  given  in  Figure  F-l)  of  the  computations  that  are  required 
to  generate  the  backward  variable  value  f3t(i)  from  the  L  values  {0t+i(j)}j=i  just  computed. 
The  figure  shows  the  path  segments  leading  from  state  i  at  time  t  to  the  L  possible  states 
that  could  be  reached  at  time  t  +  1.  Recall  that  the  quantity  j3t+i{j)  is  the  probability 
that  the  vector  zt+2:iv-i  was  observed  given  that  the  state  at  time  t  +  1  was  j.  This 
means  that  we  can  interpret  the  product  hj(zt+i)Pt+i(j)  as  the  probability  that  zt+i:w-i 
was  observed  given  that  the  state  at  time  t  +  1  was  j.  Furthermore,  we  can  interpret  the 
product  Q{i,  j)hj(zt+i)fit+i{j)  as  the  joint  probability,  conditioned  on  the  event  that  the 
state  at  time  t  was  i,  that  (i)  Zt+i:N-\  was  observed;  and  (ii)  j  was  reached  at  time  t  +  1 
by  way  of  state  i.  If  we  then  add  together  the  products  of  this  form  for  all  j,  we  obtain 
probability  that  the  vector  was  observed  given  that  the  state  at  time  t  was  i,  which 

is  simply  /?t(z).  An  analogous  computation  is  carried  out  for  each  possible  state  value  i  at 
time  t.  To  keep  the  recursion  moving  backward,  we  then  repeat  this  same  overall  sequence 
of  computations  at  the  immediately  preceding  time  index. 


We  summarize  the  backward  recursion  with  the  following  two  formulas: 
INITIALIZATION 

j9jv_i(i)  =  l,  *  =  1*2,  ,L  (F.17) 

INDUCTION 
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TIMEO 


STATE  1 
STATE  2 
STATE  3 


STATE  L 

TIME  N-1 


Figure  F-2:  Illustration  of  the  backward  recursive  procedure  used  to  compute  the  new 
quantity  from  the  L  values  {Pt+i(j)}j=i  just  computed.  Recursion  is  depicted  on  an 
L  x  N  trellis  whose  nodes  represent  possible  states  of  the  underlying  Markov  chain  at  each 
time  index.  Shaded  sets  of  nodes  are  those  for  which  backward  variable  has  already  been 
evaluated.  Vertical  dashed  line  indicates  current  stage  of  computation. 

A(i)  =  £ft+i«)Q(<',j>M*<+i>>  0  (R18) 

j- 1  ’  ’ 


F.3  Combining  the  Forward  and  Backward  Results 

Once  the  forward  and  backward  recursions  have  been  applied  to  the  observed  sequence,  we 
can  easily  combine  the  results  generated  by  these  recursions  to  obtain  the  desired  HMM  state 
probabilities.  Let  us  introduce  the  more  concise  notation  7 t(i)  to  represent  the  probability 
that  the  Markov  chain  is  in  state  i  at  time  t  based  on  the  value  of  the  observation  zo:jv-i- 
Observe  that,  from  the  definition  of  conditional  probability,  we  can  immediately  write 

7t(i)  =  Pr{©t  =  i|Z0:^-i  =  zo:JV-i} 

_  /z0;jy-1,et(!Eo:Ar-iiQt  =  *) 
fz0-.N-l(zO:N-l) 

_  fzQ..N-1,6t(V0:N-l,@t  =  *) 

X)j= 1  fzO:N_uet(Z0:N-U  ©t  =  j) 


(F.19) 

(F.20) 

(F.21) 
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Furthermore,  since  we  have  the  natural  decomposition 
/zO;N-l,©t(zO:Ar-l,0(  =  i ) 

=  &t  =  i)'  /zt+i;N_i|e*(zt+i:iv-i!®«  =  i)  (F.22) 

=  af(f)/%(<),  (F.23) 

we  obtain  the  simplified  expression 


7t(0  = 


at(i)Pt(i) 


(F.24) 


which  reveals  explicitly  the  functional  dependence  of  the  posterior  state  probabilities  on  the 
output  from  the  recursive  procedures  derived  earlier. 


F.4  Conditioning  the  HMM-Based  Recursions 


Unfortunately,  the  recursive  formulas  for  the  forward  and  backward  variables  that  we  de¬ 
rived  in  Sections  F.l  and  F.2  will  produce  distorted  results  when  implemented  directly  on 
a  digital  computer,  even  for  moderate  values  of  the  observation  length  N  (e.g.,  values  of 
approximately  100  or  more).  To  understand  why  this  is  true,  let  us  first  re-examine  the 
mathematical  structure  of  the  forward  variable  c*t(-).  By  unraveling  the  recursive  definition 
of  this  variable  at  each  successive  time  index,  starting  at  time  0,  we  can  determine  its  com¬ 
position  explicitly  at  a  general  time  t.  Assuming  that  our  HMM  consists  of  L  states,  and 
that  a  particular  realization  of  the  HMM  output  is  represented  by  zo:jv-i,  we  can  write  the 
following  non-recursive  formulas  for  the  forward  variable: 

<*o(*o)  =  P(io)hio(z0)  (F.25) 

L 

oci(ii)  =  ^2  oeoMQiio^^ki^zi) 
io=l 
L 

=  P(io)Q(io,ii)hi0(z0)hil(zi)  (F.26) 

20— 1 
L 

<*2^2  )  =  Y2  al(il)Q(il,i2)hl2(z2) 

*1=1 
L  L 

=  ^  ^  '  Pj^o'lQiioj i2)hj0 (zp)/i{1  {z2) 

t*i  =  1  io=l 


(F.27) 
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L 

=  ^2  &t{it)Q{it,it+i)hit+1(zt+i) 

it+ 1=1 

L  L  L  t  t+l 

= e  e  •  ■  ■  e  p<i»>  n  wt-u+i)  n  h> ^  <f-28) 

it=l  io=l  k=  1  fc=0 

From  these  expressions,  we  can  see  that  the  forward  variable  is  made  up  of  products  of 
probabilities  (which  are,  by  definition,  confined  to  the  interval  [0, 1])  as  well  as  products  of 
pdf  values  (which  may  exceed  unity,  but  are  usually  much  closer  to  zero).  Because  these 
terms  are  typically  quite  small,  and  because  the  number  of  terms  in  each  product  grows 
linearly  with  t,  the  products  themselves  eventually  tend  toward  zero  at  an  exponential  rate. 
Furthermore,  the  summing  of  many  such  vanishing  products  (which  is  done  to  produce  the 
final  value  of  the  forward  variable)  has  virtually  no  mitigating  effect  on  this  exponential 
decay  in  magnitude.  Thus,  when  the  recursive  algorithm  from  Section  F.l  is  implemented  in 
its  original  form,  and  t  is  allowed  to  become  sufficiently  large,  the  dynamic  range  required  in 
the  computation  of  at(-)  cannot  be  accommodated  by  the  finite-length  registers  in  a  typical 
computer,  even  if  the  computation  is  performed  using  double-precision  arithmetic. 


To  avoid  the  problem  of  eventual  register  underflow,  we  must  somehow  numerically 
condition  the  forward  variable  so  that  all  computations  performed  for  t  =  0, 1,  •  •  -  ,  N  —  1 
occur  well  within  the  dynamic  range  of  the  computer.  We  can  achieve  the  desired  condi¬ 
tioning  by  multiplying  the  forward  variable  at  each  time  index  by  an  appropriately  chosen 
time-dependent  scale  factor.  As  we  will  see,  such  a  scaling  procedure  essentially  creates  a 
normalized  version  of  the  original  set  of  forward  variables  at  each  time  index,  but  nonethe¬ 
less  allows  us  to  compute  the  new  set  of  scaled  forward  variables  recursively,  as  we  did 
originally. 


To  describe  the  scaling  procedure  in  detail,  we  first  introduce  the  notation  a't(-)  to 
represent  the  scaled  forward  variable  at  time  t.  The  values  assumed  by  this  variable  are 
propagated  forward  in  time  using  a  method  similar  to  that  used  for  the  original  variable 
£*((•),  except  that  now  the  recursive  step  consists  of  two  distinct  parts.  In  the  first  part,  the 
scaled  variable  from  the  previous  time,  aj_1(-),  is  projected  ahead  to  time  t  through  the 
usual  inductive  formula 

L 

a"(j)  =  £aU(i)Q(i,i)M*)i  3  =  1, 2,  -  -  -  ,L,  (F.29) 

i=i 

where  the  new  variable  a"(-)  has  been  introduced  to  represent  the  preliminary,  unsealed 
result  of  this  transformation.  In  the  second  part  of  the  recursive  step,  the  unsealed  variable 
a" (•)  is  transformed  back  into  its  properly  scaled  counterpart  according  to  the  formula 

i  =  1, 2,  —  ,L , 


a't(i)  =  cta"(i), 


(F.30) 
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where  Ct  is  the  normalization  factor  given  by 


(F.31) 


Note  from  (F.30)  that  at  time  t ,  the  same  scale  factor  c«  is  applied  to  each  of  the  L  terms 
{a"(i)}i=1;  thus,  although  the  scale  factor  does  indeed  vary  with  the  time  index  f,  it  is 
completely  independent  of  the  state  index  i.  The  recursive  procedure  for  the  scaled  forward 
variable  is  initialized  by  the  equations 


<*o(i)  =  Q:o(*)  i  =  1,2,  •  ,L 
1 

00  -  T.U  “SO') 

<*o(*)  =  co<*o(*)  i  —  1>2,--*  ,L. 


(F.32) 

(F.33) 

(F.34) 


Once  this  initialization  is  performed  at  t  =  0,  the  recursion  then  consists  of  a  repeated 
application  of  the  pair  of  operations  given  in  (F.29)  and  (F.30)  for  t  —  1, 2,  •  •  •  ,  N  —  1. 

With  the  new  recursion  now  fully  specified,  let  us  attempt  to  express  our  scaled  forward 
variable  a't(-)  in  terms  of  the  original  forward  variable  at(-)  at  an  arbitrary  time  t.  We  claim 
that  the  relationship  between  these  two  variables  is  given  by 


IL 

,s=0 


<**(*)• 


(F.35) 


From  the  initialization  formulas  in  (F.34),  it  is  easy  to  verify  that  this  relationship  holds  at 
time  0.  Proceeding  inductively,  then,  let  us  assume  that  it  also  holds  for  each  time  up  to 
and  including  time  t  —  1.  Then  using  (F.29)  and  (F.30),  in  addition  to  the  original  recursive 
formula  for  at(-),  at  time  t  we  may  write 


a'tti)  =  cta"{j) 


=  ct  ]T  a't-i  (i)Q(h  j)hj  (zt) 


i=i 


1=1 

n* 

Ls=o 

ric 

s=0  . 


at-i(i)Q(i,j)hj(zt) 


t-i 

IIc-i 

,s=0 
L 

i=i 

«*(*)> 


(F.36) 

(F.37) 

(F.38) 

(F.39) 

(F.40) 


which  proves  the  claim.  Now,  by  substituting  (F.35)  into  the  recursive  formula  for 
we  can  derive  a  more  direct  expression  of  the  relationship  between  the  scaled  and  original 
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forward  variables  that  does  not  involve  the  scale  factors  {cs}.  In  particular,  we  have 

L 

i=l 

Efc= 1  at- l(z)Q(*5  k)hk  {zt ) 

Ef=i  [ns c*]  at-i  (*)Q(*.  j)h3  (zt) 

E*=l  Efcl  [ris=Oc«]  at-l(^)Q[h^)hk(zt) 

_  <*tU) 

Efc= i  at(^) 

Prom  this  last  equation,  we  see  that  the  new  scaled  forward  variable  a[ (j )  is  truly  just  a 
normalized  version  of  the  original  forward  variable  at(j)  at  each  time  t,  even  though  this 
may  not  be  immediately  evident  from  the  definition  of  the  new  variable. 

Thus  far,  we  have  discussed  only  the  scaling  procedure  that  applies  to  the  forward 
recursion.  A  similar  scaling  procedure  must  also  be  employed  during  the  computation 
of  the  backward  variable  /3t(-),  since  this  variable  also  exhibits  an  exponential  decay  in 
magnitude  when  the  original  recursive  formula  is  used.  However,  we  need  not  calculate  a 
new  set  of  scale  factors  for  the  new  backward  variable;  instead,  we  can  simply  re-use  the 
ones  that  were  generated  at  the  corresponding  times  during  the  forward  recursion.  Although 
this  method  does  not  guarantee  that  the  scale  factor  applied  at  time  t  will  restore  the  sum 
of  the  backward  variables  to  unity  at  time  t,  it  nonetheless  yields  accurate  final  results, 
since  the  magnitudes  of  the  forward  and  backward  variables  are  comparable.  Moreover, 
this  strategy  of  using  common  scale  factors  for  the  two  sets  of  variables  has  the  obvious 
advantage  of  reducing  the  overall  computational  expense  associated  with  the  HMM-based 
estimation  algorithm. 

The  new  backward  recursion  is  defined  in  a  similar  manner  to  the  forward  recursion 
described  above.  In  particular,  once  again  we  use  the  two  basic  formulas 

L 

=  J2/3't+imU,i)hi(zt+l),  j  =  1,2,---  ,L  (F.45) 

i= 1 


(F.41) 

(F.42) 

(F.43) 

(F.44) 


and 


#(*)=<*#(»),  i  =  l,2,--,L,  (F.46) 

where  the  first  formula  is  used  to  project  the  scaled  variable  values  back  from  time  t  +  1  to 
time  t,  and  the  second  is  used  to  adjust  the  magnitude  of  these  projected  values.  This  new 
recursion  is  initialized  by  the  equation 


0N- l(i)  =  cN-lPN-l(i) 


i  —  1, 2,  -  •  •  ,L. 


(F.47) 
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After  this  initialization  is  performed  at  t  =  N  —  1,  the  recursion  then  consists  of  a  repeated 
application  of  the  pair  of  operations  given  in  (F.45)  and  (F.46)  for  t  =  N  —  2,  N  —  3,  •  •  •  ,0. 

By  using  an  inductive  argument  analogous  to  that  used  for  the  forward  variable,  we 
can  show  (although  the  proof  will  be  omitted)  that  the  scaled  backward  variable  can  be 
expressed  in  terms  of  the  original  backward  variable  at  time  t  as 

'  N- 1 

p't(i)  =  n c*  (p-48) 

Furthermore,  by  combining  this  identity  with  its  counterpart  from  the  forward  recursion 
(given  in  (F.35)),  we  find  that  the  product  of  a  scaled  forward  variable  and  a  scaled  backward 
variable,  taken  at  time  t  and  for  state  i,  is  given  by 

•  t  I  JV-l  ' 

=  n* at^  n Cs  &(*) 

.5=0  .  .S=t+1 

N-l  ■ 

cs  at(i)Pt(i)- 

s~0  . 

This  is  a  significant  equation  from  the  standpoint  of  calculating  the  posterior  state  proba¬ 
bility  7 j(i),  for  we  can  immediately  use  it  to  write 

Et-i 

[FIS  =<]  a,(i)A(i) 

~  [nS^] 

_  Qf  (0/^(0 

zUQtWtU)' 

This  last  equation  implies  that  we  can  operate  on  the  new  scaled  forward  and  backward 
variables  in  exactly  the  same  way  that  we  operated  on  the  original  variables  when  computing 
the  value  of  7 t(i),  a  critical  quantity  in  the  HMM-based  estimation  algorithm.  This  is  quite 
convenient,  for  it  means  that  we  will  not  incur  a  premium  in  computational  cost  to  recover 
certain  key  quantities  needed  for  signal  estimation,  despite  the  fact  that  the  underlying 
recursive  procedure  has  been  modified  significantly  by  the  numerical  conditioning  strategy. 

Another  desirable  property  of  our  new  scaling  procedure  is  that  the  coefficients  {c<} 
generated  during  the  forward  recursion  can  be  used  to  compute  the  log-likelihood  value 
l°g/z0.N-i(zO:JV-i),  which  is  very  useful  in  problems  such  as  signal  detection  and  signal 
classification  (see,  for  example,  Section  5.6. 1.2).  If  the  scaling  procedure  were  not  required 
at  all  (i.e.,  if  all  computations  from  the  original  recursions  could  somehow  be  performed 


(F.51) 

(F.52) 

(F.53) 


(F.49) 

(F.50) 
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with  infinite  precision  arithmetic),  then  this  log-likelihood  value  could  be  computed  as 

L 

log/z0:N-i(zO:iV-l)  =  log53/z0:jv_i,eW_i(aO:JV-l,©iV-l  =  *) 

t=l 
L 

=  log^  QTV_i(i), 

i=l 

where  in  the  latter  equation  we  have  merely  exploited  the  definition  of  the  original  forward 
variable  at  time  N  —  1.  As  we  know  from  our  earlier  discussion,  the  summation  appearing 
in  this  expression  cannot  be  evaluated  directly.  However,  we  can  still  obtain  the  desired 
log-likelihood  value  by  using  the  fact  that 

L  1 

(F-56) 

i=l  lit=o  ct 

which  follows  easily  from  the  previously  derived  identities  (F.35)  and  (F.44).  Of  course, 
this  last  equation  can  be  rewritten  as 

/Zo:jv-i  (z0:iV— l)  =  xiN—i  »  (F-57) 

llt=0  ct 

or  equivalently,  after  taking  logarithms  of  both  sides,  as 

N- 1 

!og  /Z0:A-_1  (zO:iV— l)  =  -  ^2  log  Ct '  (F ’58) 

t= 0 

This  demonstrates  that  the  scale  factors  by  themselves  constitute  an  extremely  valuable  by¬ 
product  of  the  new  forward  recursion,  since  they  can  be  used  to  compute  the  log-likelihood 
value  of  the  observation. 


(F.54) 

(F.55) 


Appendix  G 

Using  a  Gaussian  Mixture  to 
Approximate  a  Nonlinear 
Estimator 


In  this  appendix,  we  demonstrate  that  using  a  low-order  Gaussian-mixture  approximation 
in  place  of  the  true  signal  pdf  can  lead  to  the  development  of  a  nearly  optimal,  compu¬ 
tationally  efficient  estimation  scheme.  We  examine  a  simple  but  illustrative  non-Gaussian 
signal  estimation  problem.  In  particular,  we  consider  a  problem  in  which  we  are  allowed  to 
observe  only  the  sum  of  two  statistically  independent,  scalar-valued  random  variables,  each 
having  a  pdf  that  is  precisely  known.  One  of  these  random  variables  is  assumed  to  have 
a  non-Gaussian  pdf  and  is  designated  as  a  signal  variable;  the  other  is  assumed  to  have  a 
Gaussian  pdf  and  is  designated  as  a  noise  variable.  On  the  basis  of  our  single  observation, 
we  wish  to  generate  an  MMSE  estimate  of  the  value  assumed  by  the  signal  variable. 

We  begin  our  discussion  with  mathematical  description  of  the  elements  of  the  estimation 
problem,  and  we  then  immediately  introduce  a  Gaussian-mixture  approximation  for  the  non- 
Gaussian  signal  pdf.  Using  this  pdf  approximation,  we  in  turn  develop  an  approximation 
for  the  true  globally  optimal  processor  that  should  be  applied  to  the  observation.  We  are 
able  to  decompose  the  approximate  processor  into  a  collection  of  linear  terms,  and  we  can 
therefore  easily  make  predictions  concerning  how  the  optimal  processor  behaves  when  the 
observation  lies  in  various  ranges.  We  show  that  using  such  an  approximation  can  provide  a 
deeper,  more  intuitive  understanding  of  the  structure  of  the  optimal  estimator.  Finally,  we 
derive  the  exact  mathematical  form  for  the  optimal  processor  and  show  that  this  estimator 
does  indeed  behave  as  we  had  predicted. 


G.l  Observation  Model  and  Problem  Statement 

Let  Y  and  V  be  statistically  independent  random  variables  representing,  respectively,  the 
signal  and  noise  components  of  our  scalar-valued  observation  Z,  which  is  defined  in  the 
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usual  way  as 


Z  =  Y  +  V. 


(G.l) 


Suppose  further  that  the  signal  variable  Y  is  characterized  by  a  zero-mean  Laplacian  pdf, 
defined  by 


-oo  <  y  <  oo, 


and  that  the  noise  variable  V  has  a  zero-mean  Gaussian  pdf,  which  is  given  by 

1  f  v 2 

-exP1“, 

'v 


fv(v)  = 


y/2xay 


{-SI- 


-oo  <  V  <  oo. 


( G.2 ) 


(G.3) 


For  the  sake  of  concreteness,  we  assume  throughout  the  following  discussion  that  the  pdf 
scale  parameters  associated  with  the  signal  and  noise  variables  (i.e.,  j3  and  ay,  respectively) 
take  the  values  /3  =  2  and  ay  =  3.  (Our  choice  for  the  value  of  /3  implies  that  the  standard 
deviation  for  the  signal  is  given  by  ay  =  \/2/3  «  2.83;  hence,  the  signal-to-noise  ratio  for 
this  problem  is  approximately  0  dB.) 

Ultimately,  we  would  like  to  determine  the  explicit  form  for  the  function  y(z)  =  E{Y\Z  = 
z},  which  is  known  to  yield  the  globally  optimal  estimate  (in  the  MMSE  sense)  of  the  signal 
value  y  based  on  our  observation  of  the  event  Z  —  z.  However,  the  fact  that  the  signal 
variable  Y  is  Laplacian  makes  the  estimation  problem  considerably  more  complicated  than 
it  would  be  if  Y  were  Gaussian.  As  we  will  soon  discover  (following  a  rather  involved  deriva¬ 
tion),  the  optimal  data  processor  for  this  problem  is  indeed  nonlinear  and  has  a  complex 
mathematical  structure.  Before  delving  into  the  details  of  this  derivation,  however,  let  us 
first  try  to  predict  the  basic  form  for  this  optimal  processor  by  reasoning  —  in  an  approxi¬ 
mate  sense  —  about  what  action  this  processor  should  perform  over  various  ranges  of  the 
input  value  z. 


G.2  Using  the  Gaussian-Mixture  Approximation 

We  can  develop  our  intuitive  understanding  of  the  optimal  data  processor  by  introducing 
an  approximation  for  the  Laplacian  signal  pdf  itself.  Recall  from  the  example  we  presented 
in  Chapter  2  —  specifically,  the  example  from  our  discussion  on  the  source  identification 
problem  involving  Laplacian-distributed  driving  noise  for  an  AR  process  —  that  we  have 
already  obtained  such  an  approximation  in  the  form  of  a  three-component  Gaussian-mixture 
pdf.  In  fact,  numerous  approximations  of  this  kind  were  generated  in  that  example,  one 
for  each  of  the  experimental  trials  that  was  performed.  For  the  purposes  of  the  present 
example,  we  have  arbitrarily  selected  one  of  these  approximations,  specified  by  the  collection 
of  parameter  values 


(jiucri,Pl)  =  (0.0,0.87,0.28) 
(P2,&2,P2)  =  (0.0,2.76,0.61) 


(G.4) 

(G.5) 
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Figure  G-l:  (a)  True  Laplacian  density  with  scale  parameter  /?  —  2;  (b)  Individual  Gaussian 
densities  making  up  approximating  Gaussian  mixture. 


(A*3,  ^3,  Pz)  =  (0.0, 5.47, 0.11).  (G.6) 

In  Figures  G-l  (a)  and  G-l(b),  we  show,  respectively,  a  plot  of  the  original  Laplacian  density 
and  the  three  individual  Gaussian  densities  that  make  up  the  Gaussian-mixture  approxima¬ 
tion.  (The  densities  shown  in  Figure  G-l(b)  were  scaled  in  such  a  way  that  they  could  be 
conveniently  superimposed  on  a  single  plot.  As  depicted  in  this  figure,  the  scaled  densities 
appear  in  approximately  the  same  relative  proportions  as  they  would  if  each  original  pdf 
had  been  multiplied  by  its  associated  weighting  coefficient  p,.) 

Now  that  we  have  a  simple  and  reasonably  accurate  Gaussian-mixture  approximation  for 
the  pdf  of  the  signal  variable,  we  would  like  to  develop  an  optimal  estimator  for  the  signal 
value  based  on  the  assumption  that  our  approximation  is  exact.  Recall  that  we  have  already 
carried  out  such  a  development  in  Chapter  5.  The  key  technique  used  in  that  development 
was  to  condition  the  problem  on  each  of  the  three  possible  choices  for  the  purely  Gaussian 
pdf  that  could  give  rise  to  the  signal  value.  Under  each  condition,  the  optimal  processor 
is  a  Wiener  smoother,  since  the  noise  is  also  Gaussian.  For  this  scalar  case,  implementing 
the  Wiener  smoother  for  each  condition  is  carried  out  by  multiplying  the  measurement  by 
an  appropriately  chosen  positive  coefficient.  The  overall  approximate  processor  is  therefore 
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given  by 


yM  =  =  iiz  =  *}  z.  <G-7> 

Although  the  basic  components  of  this  processor  axe  linear  terms,  the  overall  processor  itself 
is  clearly  nonlinear,  since  the  posterior  probability  Pr{$  =  i\Z  —  z}  multiplying  the  ith 
linear  processor  is  actually  a  nonlinear  function  of  the  observed  data  z. 

All  three  linear  processors  included  in  (G.7)  are  plotted  in  Figure  G-2.  Based  on  the 
form  of  (G.7),  we  see  that  the  slope  of  each  line  is  determined  by  the  relationship  between 
the  standard  deviation  of  its  associated  signal  component,  Cj,  and  the  standard  deviation 
of  the  noise,  cry.1  Note  that  each  line  shown  in  the  figure  consists  of  a  dotted  portion  and 
a  solid  portion.  Each  line  appears  solid  throughout  the  region  where  the  corresponding 
Gaussian  component  of  the  signal  pdf  has  non-negligible  posterior  probability;  hence,  in 
regions  where  the  line  is  dotted,  the  effect  of  the  linear  processor  can  essentially  be  ignored. 
This  suggests  that  the  overall  data  processor  operates  in  three  separate  regimes,  depending 
on  whether  the  magnitude  of  the  observation  z  is  small,  intermediate,  or  large.  These 
regimes  are  labeled  in  Figure  G-2  as  Regimes  1,  2,  and  3,  respectively. 

Let  us  now  try  to  predict  what  the  processor  will  do  in  each  of  its  three  regimes.  In 
Regime  1,  while  it  is  true  that  any  of  the  three  Gaussian  components  of  the  signal  pdf  could 
have  produced  the  signal  value,  clearly  the  first  two  components  (i.e.,  those  with  the  smallest 
standard  deviations)  account  for  most  of  the  probability.  We  expect  that  the  processor  will 
be  approximately  linear  throughout  this  regime  and  will  be  situated  somewhere  between  the 
two  fines  of  smallest  slope  depicted  in  the  figure.  On  the  other  hand,  in  Regime  3,  it  is  very 
unlikely  that  any  but  the  third  Gaussian  component  of  the  signal  pdf  (i.e.,  that  with  the 
largest  standard  deviation)  could  have  produced  the  signal  value.  In  this  case  the  processor 
will  also  be  approximately  linear,  but  it  will  now  virtually  coincide  with  the  fine  of  largest 
slope.  Finally,  in  Regime  2,  the  processor  will  once  again  be  approximately  linear,  but  it 
will  be  structured  in  such  a  way  that  it  connects  the  other  linear  functions  from  Regimes  1 
and  3.  In  fact,  as  we  can  see  from  Figure  G-3,  this  is  precisely  how  the  nonlinear  processor 
given  in  (G.7)  appears  when  plotted. 


G.3  Derivation  of  Globally  Optimal  Processor 

To  assess  the  accuracy  of  our  prediction,  let  us  now  derive  the  functional  form  of  the  actual 
optimal  processor  E{Y\Z  =  z}  for  this  non-Gaussian  problem.  We  begin  by  going  back  to 


lIn  general,  the  slope  of  each  line  can  range  from  zero  to  unity.  If  <x;  is  extremely  small  relative  to  ov, 
then  the  slope  of  the  corresponding  line  will  be  near  its  minimum  value  of  zero;  this  represents  the  case  in 
which  there  is  almost  no  information  to  be  gained  from  a  single  observation  of  the  corrupted  signal  value. 
As  the  value  of  d  increases  relative  to  ov ,  the  corresponding  line  becomes  steeper,  and  hence  more  weight  is 
given  to  the  observation.  Finally,  if  Oi  is  extremely  large  relative  to  <jv  ,  the  slope  of  the  corresponding  line 
approaches  its  maximum  value  of  unity;  in  this  case,  the  observation  is  rich  in  information  and,  accordingly, 
the  optimal  data  processor  is  approximately  the  identity  function. 
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Figure  G-2:  Optimal  linear  processors  associated  with  the  three  Gaussian  densities  in  ap¬ 
proximating  mixture.  Solid  portion  of  each  line  indicates  region  where  the  corresponding 
density  has  non-negligible  probability. 


the  definition  of  conditional  mean,  from  which  we  can  write 

/OO 

yfy\z(y\Z  =  z)dy 

•OO 

_  fZo  yfv,z{yi  z)dv 

JZ0fY,z(y,z)dy  ’ 


(G.8) 

(G.9) 


where  in  the  latter  step  we  have  used  the  fact  that  the  conditional  pdf  of  Y ,  given  that 
Z  =  z,  is  simply  a  normalized  version  of  the  joint  pdf  of  Y  and  Z  when  this  joint  pdf  is 
viewed  as  a  function  of  y  alone  (i.e.,  when  z  is  assumed  fixed).  From  the  definitions  given 
in  (G.2)  and  (G.3),  we  have  that 


fY{y)fz\r(z\Y  : 

=  y) 

(G.10) 

fY(y)fv(z  -  y) 
i  | 

r-M\exp|- 

(Z  -  y)2  \ 

(G.ll) 
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If  we  now  substitute  this  expression  into  (G.9)  and  then  cancel  like  terms  from  the  numerator 
and  denominator,  we  obtain  the  equivalent  estimation  formula 


E{Y\Z  =  z}  = 


TOO 

J-oo  yexP 

)  dy 

Ho  exp  ( 

r£+ f-V) 

dy 

(G.14) 


Now,  by  selecting  appropriate  limits  of  integration  and  introducing  a  simple  change  of 
variables,  we  can  rewrite  the  above  expression  so  that  terms  of  the  form  \y\  never  appear. 
The  new  formula  is  given  by 


E{Y\Z  =  z} 


fo°°  yex  p\ 

r _ uL  _  | 

L  2%  1 

l  fav  ) 

'»] 

\  dy  -  Jo00  yexp 

( <xl+0z\  __ 

\p4T)y 

}dy 

Jo°°  exP  \ 

r _ uL  -  1 

L  1 

\  $4  ) 

\y}dy  +  /0°°  exp  | 

~%k~ 1 

(  <r1+0z\ 

\  $4  J 

1  y} 

dy 

(G.15) 

Each  integral  appearing  in  this  expression  can  easily  be  evaluated  using  standard  integral 
tables  [69].  From  the  tables,  we  have,  for  given  constants  A  and  B,  that 

J  y  exp  | —  By ^  dy  —  2A  —  ‘lABsfirA  •  exp(AB2)  •  erfc(VAB)  (G.16) 

J  exp|-£^  -By^dy  =  VttA  ■  exp(AB2)  -erfc(y/AB),  (G.17) 

where  erfc(-)  is  the  complementary  error  function  given  by 

2  f°°  ? 

erfc{x)  =  -7=  /  e  ‘  dt.  (G.18) 

J X 

For  each  of  the  four  definite  integrals  appearing  in  (G.15),  the  constants  A  and  B  are 
easy  to  identify.  Upon  substituting  the  integral  values  back  into  (G.15),  we  find,  after  a 
considerable  amount  algebraic  manipulation,  that  the  optimal  data  processor  in  this  case 
reduces  to 


Fsy\7  _  r\  =  (*  ~  °y/P)G{z)  +  (z  +  Oy/(3)H(z) 

{  |Z  '  G(z)  +  H(z) 

z(G(z)  +  H(z))  -  (*l/0){G(z)  ~  H(z)) 
G(z)  +  H(z) 

Oy  G(z)  —  H{z) 

{3  [G(z)  +  H(z) J’ 


(G-19) 

(G.20) 


(G.21) 
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where  G(-)  and  H(-)  are  the  rather  complicated  nonlinear  functions  defined  by 


G(z)  =  exp 


' o\  -  0z' 

,  V2f3o 


erfc 


( °y  ~  @z\ 

V  V2 0o  ) 


(G.22) 


and 


(G.23) 


It  can  be  proven  without  too  much  difficulty  (although  the  details  will  be  omitted  here) 
that  the  following  three  limits  hold  for  the  ratio  [G(z)  —  H{z)]/[G(z)  +  H(z)}: 


lim  1 

z-r-oo  G(z)  +  H (z) 

(G.24) 

o  G(z)  +  H(z) 

(G.25) 

G(z)-H(z) 
oo  G(z)  +  H{z ) 

(G.26) 

From  these  three  facts  alone,  we  can  begin  to  envision  the  form  that  the  optimal  processor 
must  possess.  Specifically,  using  (G.21),  we  have  that  the  optimal  processor  is  zero  at  z  =  0, 
approaches  the  positively  biased  linear  function  z  +  (oy/(3)  as  z  -»  —  oo,  and  approaches  the 
negatively  biased  linear  function  z— {o\r //?)  as  z  — ►  +oo.  A  plot  of  the  true  optimal  processor 
is  shown  as  the  solid  curve  in  Figure  G-4.  Also  shown  in  this  figure  is  our  approximation 
to  the  optimal  processor  that  results  from  using  our  three-component  Gaussian  mixture 
specified  in  G.6. 

Because  the  optimal  processor  and  its  approximation  are  so  close  in  this  example,  they 
yield  nearly  identical  mean  squared  error  values  of  about  4.07.  Since  the  Laplacian  pdf 
represents  a  deviation  from  a  Gaussian  pdf  that  is  not  very  severe,  a  linear  processor  would 
probably  be  adequate  in  this  case.  A  logical  choice  for  a  linear  estimator  would  be  the 
optimal  processor  associated  with  a  Gaussian  signal  pdf  whose  standard  deviation  is  the 
same  as  that  of  the  Laplacian  pdf,  namely  cry  =  y/20  =  2\/2;  applying  such  a  processor  in 
this  example  yields  a  slightly  higher  MSE  value  of  about  4.24. 
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Figure  G-3:  Linear  data  processors  (dashed  curves)  associated  with  individual  components 
of  the  Gaussian  mixture  and  the  resulting  approximate  nonlinear  data  processor  (solid 
curve)  associated  with  overall  mixture. 


Figure  G-4:  Optimal  processor  (solid  curve)  and  approximate  processor  (dashed  curve)  for 
estimating  a  Laplacian  random  variable  in  additive  Gaussian  noise. 
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